2nd World Conference on Technology, Innovation and Entrepreneurship (WCTIE-2017), V.5,p.368-371    Zenuni, at al. 

_____________________________________________________________________________________________________ 
 DOI: 10.17261/Pressacademia.2017.612                                         368                                                          PressAcademia Procedia 

 
 2nd World Conference on Technology, Innovation and Entrepreneurship 
 May 12- 14, 2017, Istanbul, Turkey. Edited by Sefer Şener 

 
AUTOMATIC HATE SPEECH DETECTION IN ONLINE CONTENTS USING LATENT SEMANTIC ANALYSIS 
 
DOI: 10.17261/Pressacademia.2017.612 
PAP-WCTIE-V.5-2017(50)-p.368-371 
 
Xhemal Zenuni

1
, Jaumin Ajdari

2
, Florije Ismaili

3
, Bujar Raufi

4 
1
South East European University.  xh.zenuni@seeu.edu.mk    

2
South East European University.  j.ajdari@seeu.edu.mk  

3
South East European University.  f.ismaili@seeu.edu.mk  

4
South East European University. b.raufi@seeu.edu.mk  

 
ABSTRACT  
Internet in general and social media in particular have greatly facilitated the communication, interaction and collaboration among people 
and different entities. As generally there is no censorship, these media sometimes are used to proliferate discourses that contain hateful 
messages targeting ethnic origin, religious or sexual groups, which potentially may degenerate to violent acts against individuals of such 
groups. Therefore, we explore the idea of building of automatic classifier that can be used for detection of hate speech in public Albanian 
language pages. A hate speech corpus for Albanian language is created, and then based on Support Vector Machine (SVM) approach,  an 
automatic hate speech detection system is proposed. Such system can be used to detect and analyze hate speech in online contents over 
time and to enhance our knowledge on how they affect opinion creation in society.  
 

Keywords: Hate speech detection, text classification, support vector machines, NLP, Albanian language  
 
 
1. INTRODUCTION 

The continous growth of social media and other Internet services, such as Facebook, Twitter, microblogging or Web services 
among others has greately facilitated the information exchange, interaction and collaboration among people and different 
entities. However, the widespread adoption of social media and other online services offers new opportunities to 
dissiminate hateful messages. Up to date, there is very little research and evidence how the diffusion of hate speech in 
online contents could trigger hate crimes, yet this potential is recently recognised. For example, Facebook and Twitter 
pledge to remove hate speech contents within 24 hours after they are reported (Kottasova, 2016). On the other side, EU 
despite its security and political situations, launched a “code of conduct” to establish public commitments for the biggest 
Internet companies that the valid hate speech contents will be removed and yet the right to freedom of expression will be 
preserved (Commision, 2016). 

In this context, automatic detection of abusive and hate speech in online contents becomes important topic and task. An 
automatic detection method could scan large amount of text, analyze and categorize it as hateful or not. The trends of 
hateful messages could not only be reported to relevant authorities, but it could provide a solid ground to researchers to 
understand how hateful messages in online contents affect the social processes. 

But as noted in (Thomas Davidson, 2017), effective automatic hate speech detection is challenging and very difficult task. 
The difficulties mostly come from the complexity of natural language processing. The ambiguity and language variability 
represents a real challenge to be solved.  On the other hand, when building more complex and effective automatic machine 
learning text classifier, the training data becomes crucial.  

In this paper, we aim to develope a method to detect hate speech in public online contents in Albanian language, while also 
addressing the above mentioned challenges. We have collected data from public Facebook pages in Albanian language, and 

mailto:xh.zenuni@seeu.edu.mk
mailto:j.ajdari@seeu.edu.mk
mailto:f.ismaili@seeu.edu.mk
mailto:b.raufi@seeu.edu.mk


2nd World Conference on Technology, Innovation and Entrepreneurship (WCTIE-2017), V.5,p.368-371    Zenuni, at al. 

_____________________________________________________________________________________________________ 
 DOI: 10.17261/Pressacademia.2017.612                                         369                                                          PressAcademia Procedia 

 
labeled them as hate speech or not. Than a classifier based on SVM (support vector machines) is trained to differentiate 
between these categories. To our best knowldge, the contribution in this paper is two-fold: 

 It represents the first attempt to create a hate speech corpus in Albanian language 

 We make the first attempts to create a hate speech text classifier for Albanian language based on supervised 
machine learning approach 

2. LITERATURE REVIEW  

Bag-of-word approaches like in (Kwok & Wang, 2013) are simpler to implement, especially if the classifier is targeting racial 
hate of speech, but such approaches are insufficieent for accurate classification as it leads to high rates of false positives. 

Syntactic features have been explored in (Gitari, Zuping, Damien, & Long, 2015). The experimental results has shown 
improvements both on precision and recall when used semantic, hate and theme-based features. Chen (Chen, Zhu, Zhou, & 
Xu, 2012) utilize the profanties, obscenities and pejorative terms as features, weighted accordingly and produced a set of 
rules  to model offensive content, which improved the precision on standard machine learning approaches. 

Leveraging morpho-syntactical features, sentiment polarity and word embedding lexicons, Vigna (Vigna, Cimino, 
Dell'Orlleta, Petrocchi, & Tesconi, 2017) proposed two hate speech classifiers for Italian language based on Support Vector 
Machines (SVM) and on Reccurent Neural Network named Long Short Term Memory.  

Other suppervised approaches to hate speech classification have been proposed as well. Neural language models have 
potential (Djuric, Zhou, & Morris, 2015), but in all cases the training set data is important. Moreover, the accuracy of hate 
speech classsifiers could be improved by non-linguistic features, like the gender, ethnicity or age of the author, but this 
information is often unreliabale or unavailable (Waseem & Hovy, 2016). 

3. HATE SPEECH CORPUS 

To our best knowledge, there is no previous work on building a hate speech corpus for Albanian language. Therefore, during 
a period of time, we collected data from Facebook pages in Albanian language and prepared a hate speech corpus that 
could be used by a classifier. This section reports on data collection, annotatin phase, preprocessing and feature selection in 
data. 

3.1. Data Collection 

We explored the Graph API (https://developers.facebook.com/docs/graph-api) provided from Facebook to reterieve and 
build a corpus of comments from two public pages that publish posts on variety of topics on different political and social 
events, and which we suspected to find a lot of comments contating hateful speeches. On the other side, we also looked 
forward posts that contained a significant number of comments. Table 1 summarizes the pages that where crawled and the 
number annotated posts and comments 

Table 1: Dataset Description and Annotations 

 
3.2. Data Annotation 

Two annotators were asked to analyze the content of the crawled comments and to categorize them as hate or no hate. 
Overall, 4886 comments received two annotations, and as hate were considered only the comments that were categorized 
as such from both annotators. In total, 2764 comments were categorized as containing hateful content and on other 
comments either both annotators agreed that the message is not hateful or no concensus was reached. 

3.3 Data Preprocesing 

In order to prepare the data for the supervides learning algorithm, several pre – processing steps were undertaken. First, 
the collected text was transformed to lowercase with the objective to improve syntactic matching. Then, extra white 
spaces, puncuation marks, digits and emoji were removed from the text as they were not considered important in the 
classifiction process. Finally, we removed from the text the words which we find redundant for text classification (such as 
conjuctions) and consequently reduced the size of document-term matrix.   

 
Facebook pages Annotated Post # of Commnets 

jetaoshqef 108 4737 

tvklan 19 149 


2nd World Conference on Technology, Innovation and Entrepreneurship (WCTIE-2017), V.5,p.368-371    Zenuni, at al. 

_____________________________________________________________________________________________________ 
 DOI: 10.17261/Pressacademia.2017.612                                         370                                                          PressAcademia Procedia 

 
4. TEXT CLASSIFICATION MODEL 

We tested the Support Vector Machines (SVM) as supervised learning technique used for text classification. As algorithm it 
captures sparse and discrete features in text classification, which makes it good candidate in our case. On the other side, as 
noted in (Joachims, 1998) there are theoretical evidence that SVM is an extremely strong performer when having high 
dimensional input space, few irrelevant features and especilly when most of text classification problems are linerely 
separable. 

We implemented the approach in R System (http://www.rsystems.com/) based on RTextTool. RtextTool is an easy to use 
tool that can be used for end-to-end implementation by interfacing with existing pre-processing routines and machine 
learning algorithms. The supporting features include the process from document-term matrix creation, data pre – 
porcessing, training, classification, up to analytical reports which help users to understand the classification of the 
employed model. 

The speech corpus was divided in two parts. 4000 records were used as training set, and the rest of 886 records were used 
as testing set. And than based on this dataset, RTextTool functons were used to implement the text classification workflow. 

5. FINDINGS AND DISCUSSIONS 

While there are many techniques to evaluate the performance of the algorithm, precission, recall and F-score are 
considered standard evaluation metrics in classification tasks. Accuracy measures In the context of the hate speech system, 
the accuracy tells what propotion of hate speech comments, are actually hate speech content. Recall tells what percentage 
of hate speech comments did the algorithm correctly classify, and F-score produces a weighted average of precision and 
recall. 

Table 2 reports the results for the conducted experiement. And the numbers were generated through create_analytics( ) 
function contained in RTextTool. 

Table 2: Evaluation of Classification Model 

 
6. CONCLUSION 

This paper presents the first efforts in bulding an automated hate speech classifier for Albanian language texts. The first 
experiments show that binary classificaion based on Suport Vector Machines are a promising approach toward building an 
automated hate speech detection system for online text contents for Albanian language. We are encouraged by initial 
results, however for the hate classfier of Albanian language to achieve results comparable with similar approaches, it needs 
richer hate speech corpus and and to explore other language processing features of Albanian language, which for the time 
being are lacking. However, we believe this work represents the basis toward a building an automated system that could be 
used to track and monitor online content. 

As future work, we intend to extend the annotaed hate speech corpus from different Facebook sites and crawl more 
comments. This will make richer the current training set, which we believe it will consequently increase the evaluation 
metrics employed in standard classification tasks. Another important aspect will be to see how other similar supervised 
learning models will work under the same speech training set. 

 
REFERENCES 

Chen, Y., Zhu, S., Zhou, Y., & Xu, H. (2012). Detecting Offensive Language in Social Media to Protect Adolescent Online Safety. Proceedings 
of the Fourth ASE/IEEE International Conference on Social Computing. Amsterdam. 

Commision, E. (2016). CODE OF CONDUCT ON COUNTERING ILLEGAL HATE SPEECH ONLINE.  

Djuric, N., Zhou, J., & Morris, R. (2015). Hate Speech Detection with Comment Embeddings. Proceedings of the 24th International 
Conference on World Wide Web, (s. 29-30). 

Gitari, N., Zuping, Z., Damien, H., & Long, J. (2015). A Lexicon-based Approach for Hate Speech Detection. International Journal of 
Multimedia and Ubiquitous Engineering, 2015-230. 

(tarih yok). http://www.rsystems.com/.  

https://developers.facebook.com/docs/graph-api. (tarih yok). 

Classifier Precission  Recall F-Score 

SVM .61 0.57 0.58 


2nd World Conference on Technology, Innovation and Entrepreneurship (WCTIE-2017), V.5,p.368-371    Zenuni, at al. 

_____________________________________________________________________________________________________ 
 DOI: 10.17261/Pressacademia.2017.612                                         371                                                          PressAcademia Procedia 

 
Joachims, T. (1998). Text Categorization with Support Vector Machines: Learning with Many Relevant Features. European Conference on 
Machine Learning, (s. 137-142). 

Kottasova, I. (2016). Facebook and Twitter pledge to remove hate speech within 24 hours. 
http://money.cnn.com/2016/05/31/technology/hate-speech-facebook-twitter-eu/. 

Kwok, I., & Wang, Y. (2013). Locate the Hate: Detecting Tweets against Blacks. Proceedings of the Twenty-Seventh AAAI Conference on 
Artificial Intelligence, (s. 1621-1622). 

Thomas Davidson, D. W. (2017). Automated Hate Speech Detection and the Problem of Offensive Language. In the Proceedings of ICWSM 
2017.  

Vigna, D. V., Cimino, A., Dell'Orlleta, F., Petrocchi, M., & Tesconi, M. (2017). Hate Me, Hate Me Not: Hate Speech Detection on Facebook. 
ITASEC. 

Waseem, Z., & Hovy, D. (2016). Hateful symbols or hateful people? Predictive features for hate speech detection on Twitter. Proceedings of 
the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (s. 
88-93).