key: cord-0593972-iddcdrdu authors: Ghosh, Satyajit; Ghosh, Aniruddha; Ghosh, Bittaswer; Roy, Abhishek title: Plagiarism Detection in the Bengali Language: A Text Similarity-Based Approach date: 2022-03-25 journal: nan DOI: nan sha: a4311c516544a39e95a386822393f60b8bed8b90 doc_id: 593972 cord_uid: iddcdrdu Plagiarism means taking another person's work and not giving any credit to them for it. Plagiarism is one of the most serious problems in academia and among researchers. Even though there are multiple tools available to detect plagiarism in a document but most of them are domain-specific and designed to work in English texts, but plagiarism is not limited to a single language only. Bengali is the most widely spoken language of Bangladesh and the second most spoken language in India with 300 million native speakers and 37 million second-language speakers. Plagiarism detection requires a large corpus for comparison. Bengali Literature has a history of 1300 years. Hence most Bengali Literature books are not yet digitalized properly. As there was no such corpus present for our purpose so we have collected Bengali Literature books from the National Digital Library of India and with a comprehensive methodology extracted texts from it and constructed our corpus. Our experimental results find out average accuracy between 72.10 % - 79.89 % in text extraction using OCR. Levenshtein Distance algorithm is used for determining Plagiarism. We have built a web application for end-user and successfully tested it for Plagiarism detection in Bengali texts. In future, we aim to construct a corpus with more books for more accurate detection. Cambridge advanced learner's dictionary defines Plagiarism as "The process or practice of using another person's ideas or work and pretending that it is your own" [1] . As per the survey conducted by Josephson Institute Center for Youth Ethics, it has been found that 59% of high school students admitted cheating and every one out of three students use the internet to plagiarize their assignments. Online learning and tests have seen exponential growth in the context of the COVID-19 pandemic. The quality of assignments should be checked properly by teachers to minimize plagiarism at an early stage [2] . It is a time consuming and tedious job to check every paper for plagiarism. To help the teachers, evaluators, and researchers there are multiple Plagiarism detecting tools available in the market. These tools are mostly domain-specific and designed for detecting plagiarism in English texts. They are either incapable or inefficient for plagiarism detection in other languages, especially in Indian regional languages. Bengali is one such language. It is developed for roughly 1300 years and the timeline of Bengali literature is divided into three phasesancient, medieval, and modern [3] . Thus, Plagiarism detection in Bengali requires a corpus consisting of many old and new literature. The authors built one corpus with a small subset of Bengali Literature. Further, they performed Plagiarism detection using it to analyze its performance and accuracy. The rest of the paper is organized as follows: Section 2 presents the proposed methodology and implementation for corpus creation. In Section 3 we discuss Plagiarism detection using our proposed algorithm and implementation of it using different tools and technologies. Section 4 reports the experimental results and performance of our detection tool. Finally, Section 5 concludes this paper. A corpus is a collection of many processed, ordered, and selected texts. A text corpus is used for data mining [4] , natural language processing [5] , emotion analysis [6] , Plagiarism detection [7] and many more things. Our goal is to make a corpus that will be helpful for Plagiarism detection and in future can be used for other works as well. The proposed methodology is a step-by-step process for the corpus generation of Bengali Literature. Below is the pictorial representation of the proposed methodology. Step 1 includes the collection of Bengali literature books. National Digital Library of India at the time of writing this paper had more than 400 books on this topic from various sources. We have collected 200 books for the generation of the corpus. This includes both old and new books of Bengali Literature. Most of the books are present in the library are scanned PDF copies of physical books. In Step 2 we have used Optical Character Recognition or OCR technology to extract the Bengali texts from the PDF files. Step 3 includes text cleanup as while extracting the texts from the books it has been found that invalid characters, non-Bengali characters, and whitespaces are coming out. So, we have removed all those from the texts and in Step 4 we have stored the cleaned texts in the database. To generate the corpus, we need some tools and programs they are as follows: As a first step, we have collected Bengali Literature books from the National Digital Library of India and stored them in a directory after renaming them with unique identifiers and storing their other details in an Excel file. Tesseract is an open-source OCR engine that can be trained to recognize any language and it supports more than 100 languages out of the box as stated by its official documentation. Bengali is also one such language that is supported by Tesseract out of the box using pre-trained models. Tesseract takes images as input for OCR operation, so we have converted the PDF files into images using a small Python program and stored the images on the hard disk on respective folders. Then we have provided the images as input to the OCR engine using OpenCV. Other image processing steps like Rescaling, Binarization, Noise Removal etc. are done by Tesseract internally, so we do not have to perform them. After observing the extracted texts by Tesseract, it has been found that the texts contain whitespaces and invalid characters. As Bengali characters are present in the range of U+0980 to U+09FF in the Unicode block, so we have used a regular expression to remove any characters outside this range and the whitespaces. To store the corpus, we have used the SQLite3 database. SQLite is a lightweight disk-based database, and it does not require a separate server process. After the clean-up process, the extracted texts are stored within the database. Next, a small Python program is used to store the other details of the book from the excel file to the SQLite database using Pandas library. Accuracy is the quality of being correct and precise. It is very important to determine the accuracy of OCR operation for a good quality corpus. We have found that our collection can be grouped into two categories. The first category consists of old books where the images of scanned pages have noise, black marks on the border and unclear texts. We call it as "Dirty" category. The second category consists of the recent books where the images of scanned pages are mostly clear and without any noise. We call it as "Clean" category. shows the extracted texts from a page where the text is not clear and has noise and an uneven border. We have selected random pages from the collection and found that we are having average accuracy of 72.10 % from this category. shows the extracted texts from a page where the text is clear and has no noise. We have selected random pages from the collection and found that we are having average accuracy of 79.89 % from this category. For determining the accuracy, we have used Levenshtein Distance Algorithm. We have provided manually entered actual text of the page and the text which is extracted by the OCR engine to the algorithm. The algorithm checks for the similarity between two texts. We have discussed this algorithm in detail in the 3.1 section of this paper. If one person represents another person's work as their original work, then it is considered plagiarism. Plagiarism generally is not a crime but in academia and industry, it is an ethical offence [8] . The exponential growth in digital resources increases the possibility of plagiarism. Plagiarism can appear in multiple ways like claiming another person's work as own or using another person's work without giving credit [9] . We have Citation-Based, Semantic-Based, Cross Language-Based, Structural-Based and Character-Based methods for detecting Plagiarism in a text [10] . Our proposed algorithm works based on text similarity between two documents. The Levenshtein distance (also known as Edit Distance) is an algorithm that is used to measure the minimum number of edits required for changing one string to another using only three operations they are Insertion, Removal or Replacement of character. Levenshtein Distance algorithm can be implemented with a recursive solution or dynamic programming. The time and space complexity of this algorithm when implemented using dynamic programming is O (m * n) [11] , [12] . The Levenshtein algorithm score is inversely proportional to the similarity of two strings. Let us suppose we have two strings S1 and S2 having a length of M and N respectively and their Levenshtein algorithm score is denoted by DIFF then similarity calculation formula will be : ( 1, 2) = 1 − max ( , )  Our plagiarism checker application takes the user input. Then it tokenizes the user input and removes the stopwords. The same steps are followed for texts in the database. This helps us to reduce the time taken by the algorithm and improves its accuracy. After that, our algorithm compares the user input with every page of every book on the data-base and stores the similarity scores. After completing comparisons, it fetches the information about the pages in which the similarity scores are highest and met the threshold limit to consider it as plagiarism. In the end, it displays the result to the user along with book title, author name, page number and similarity score. To implement the Bengali Plagiarism Checker, we have used Flask. Flask is a web framework that helps us to make web applications with a Python backend. The Flask application communicates with the SQLite database which we have prepared during corpus generation and serves the end-user. Here, we have provided texts as input. Then we have calculated the similarity score using our web application. We have observed our application can successfully find out the correct book title, author name and page number and similarity scores. From our observations, we have found that in both the categories our Plagiarism detection is working properly except on a few "Dirty" books and the similarity score of 20 and above can confirm Plagiarism. The accuracy of the Plagiarism detection is proportional to the amount of text we give as input. More the text, the better the detection. Plagiarism is one of the most serious problems faced by researchers and evaluators. E-learning and blended mode of teaching increase the probability of plagiarism by many times. This paper presented a comprehensive methodology for determining plagiarism in Bengali texts coupled with the corpus generation process for it. The same methodology can be applied to other regional languages for plagiarism detection. In this research, the Levenshtein Distance algorithm is successfully implemented on a web application for serving end-users requests to determine plagiarism in Bengali literature. All the source codes are available on our GitHub repository. 1 The volume of Bengali literature is huge. The collection of a greater number of books will increase the capability of the detection tool. In future, a web scraper can be built to collect a greater number of books. Though the proposed algorithm works well the time complexity of it is high and takes over a minute to provide the results. More efficient use of data structure and modified version of the proposed algorithm may reduce the detection time. Cambridge advanced learner's dictionary Plagiarism in e-learning systems: Identifying and solving the problem for practical assignments History of Bengali literature A review of text corpus-based tourism big data mining Web text corpus for natural language processing Bhaav-a text corpus for emotion analysis from hindi stories Development of Marathi Text Corpus for Plagiarism Detection in Marathi Language Plagiarism, norms, and the limits of theft law: Some observations on the use of criminal sanctions in enforcing intellectual property rights Plagiarism: Taxonomy, Tools and Detection Techniques Survey of text plagiarism detection The algorithm design manual A guided tour to approximate string matching Levenshtein Distance, in Three Flavors