key: cord-0457087-qydc4446 authors: Al-Nabki, Mhd Wesam; Fidalgo, Eduardo; Alegre, Enrique; Alaiz-Rodr'iguez, Roc'io title: Short Text Classification Approach to Identify Child Sexual Exploitation Material date: 2020-10-29 journal: nan DOI: nan sha: b1d78848ec2a6c25276da3def92ee37fe4972db5 doc_id: 457087 cord_uid: qydc4446 Producing or sharing Child Sexual Exploitation Material (CSEM) is a serious crime fought vigorously by Law Enforcement Agencies (LEAs). When an LEA seizes a computer from a potential producer or consumer of CSEM, they need to analyze the suspect's hard disk's files looking for pieces of evidence. However, a manual inspection of the file content looking for CSEM is a time-consuming task. In most cases, it is unfeasible in the amount of time available for the Spanish police using a search warrant. Instead of analyzing its content, another approach that can be used to speed up the process is to identify CSEM by analyzing the file names and their absolute paths. The main challenge for this task lies behind dealing with short text distorted deliberately by the owners of this material using obfuscated words and user-defined naming patterns. This paper presents and compares two approaches based on short text classification to identify CSEM files. The first one employs two independent supervised classifiers, one for the file name and the other for the path, and their outputs are later on fused into a single score. Conversely, the second approach uses only the file name classifier to iterate over the file's absolute path. Both approaches operate at the character n-grams level, while binary and orthographic features enrich the file name representation, and a binary Logistic Regression model is used for classification. The presented file classifier achieved an average class recall of 0.98. This solution could be integrated into forensic tools and services to support Law Enforcement Agencies to identify CSEM without tackling every file's visual content, which is computationally much more highly demanding. In 2017, the Council of the European Union (EU) prioritized cybercrimes related to Child Sexual Abuse (CSA), considering them as the most serious crimes between the years 2018 and 2021 [1] . According to The European Police Office, Child Sexual Exploitation Material (CSEM) is defined as sexual abuse of a person under 18 years old and producing images or videos of the abuse and distributing such content online [2] . Darknets, such as The Onion Router (Tor) 1 [3, 4, 5] and FreeNet 2 [6] and also Peer to Peer (P2P) networks, like eDonkey, [7, 8] are environments where the interchange of CSEM seem to proliferate, thanks to the high level of privacy and anonymity provided to their users. These characteristics allow pedophiles to easily share CSEM far away from Law Enforcement Agencies (LEAs) monitoring. It is worth mentioning that during the COVID19 outbreak, Interpol has reported a significant increase in exchanging CSA material in P2P and Darknet networks as well as online gaming and messaging applications [9] . CSEM producers and consumers might save this content on their local computer machines, at least temporarily. When an LEA inspects a home to analyze a suspect's computers, a police agent reviews the files in the investigated hard drive, trying to determine whether or not the suspected of pedophilia has stored CSEM in the computer [10] . This process needs to be accomplished in a limited time and as accurately as possible [11] . This work aims to build a File Classifier (FC) that decides whether a given file is related to CSEM or not according to its name and absolute path. The FC will not tackle the content as other modules, which are out of this paper's scope. Hence, FC will act as a preliminary filter in a CSEM detection pipeline. Building an automatic FC is a challenging task due to several reasons. Firstly, a binary supervised algorithm requires training samples of Non-CSEM and CSEM files. However, there are no publicly available datasets of the latter class, and crawling samples from a P2P network or the Darknet is illegal [12] . Therefore, only CSEM file names obtained legally, i.e., provided by LEAs, could be used. Secondly, a file name typically is a text of small length, which leads to a sparse representation of the samples because we have a massive number of features, while an instance is only represented with a few of them. Finally, CSEM producers or consumers tend to invent a personalized file name style to create their vocabulary, abbreviations, and acronyms to circumvent detection tools, using a personalized obfuscated writing style. For example, in a sample named "!!!!yoB0yXX ", the exclamation marks refer to the age of a boy, and the letter O is replaced by the number zero. Hence, most likely, this file is related to the abuse of a four years old boy. It worth mentioning that the last two challenges, i.e. the lack of context and the deliberate distortion of the text, make it more difficult to build or to use pre-trained language models, such as word2vec [13] and GloVe [14] , and contextualized word representations, like Bidirectional Encoder Representations from Transformers (BERT) [15] and Embeddings from Language Models (ELMO) [16] . Nevertheless, high accuracy may not be achievable as some of these resources carry minimal information. This would play a key role in filtering those files with a higher probability of being CSEM and facilitating the analysts' work that otherwise would be unfeasible. A small body of research was focused on the problem of identifying CSEM via their file names. The most recent works are the research of Pereira et al. [17] and Al-Nabki et al. [18] where they experimented with different deep learning and machine learning algorithms to build a supervised classifier for the files. Unlike the common strategies that use file names only, this paper attempts to dive further to incorporate their absolute path in parallel. The approach of using both pieces of information, i.e. the name and the path, has been presented for the first time by Pereira et al. [17] when they train a single classifier to classify the absolute file path, including the file name. In contrast, the File Classifier (FC) we propose in this paper uses dedicated classifiers for each component, a File Path Classifier (FPC) to classify the absolute paths and a File Name Classifier (FNC) for file names. Their outputs are fused into a single score. This design will prevent the classification decision from skew to the absolute file path's content only, which typically occupies most of the text. We counted on our previous work for the FNC design [18] but after extending the file name representation and using a bigger dataset. Furthermore, this paper elaborates on the FPC and demonstrates how its output was integrated with the FNC. The file name and the file path could complement each other when any of them carries a CSEM pattern. Nevertheless, this approach will not be advantageous when neither sources exist, such as a file named only with numbers and located in the root folder. We propose two approaches to build the FC (see Fig. 1 ). The first one uses two standalone classification models, one for the FNC and another for the FPC. The outputs of these two classifiers are fused into a single output. The other approach employs the FNC only to classify the file name and the path. It iterates over the absolute path along with the file name, and whenever the FNC detects a CSEM name within the path, it reports the file as CSEM. The main contributions of this paper are summarized as follows. • We propose a framework for classifying files into CSEM or safe material based on the fusion of the output of two supervised classifiers, which uses file names and their absolute paths. • We extend the text of the file names by appending two additional intermediate representations suited for the task of CSEM detection. The first one is a novel binary representation, which distinguishes character blocks from noncharacter ones. The second is an orthographic feature that captures the variation in the types of file name characters. To the best of our knowledge, the orthographic feature has not been used before to code file names for text classification, but for named entity recognition tasks [19, 20] . • We build a dataset with 5.9M and 890K unique file paths and file names samples, respectively. To be the best of our knowledge, this is the largest dataset used for classifying CSEM using file names and paths. • We apply our framework into a real-case application: CSEM detection. We also introduce our framework into a practical forensic tool that could support the task of CSEM detection to the LEA worldwide. The rest of the paper is organized as follows: Section 2 presents the related work. Section 3 describes the proposed classification methodology. Next, Section 4 explains how the datasets of the FNC and the FPC are created and what are their main features. Then, in Section 5, we describe the experimentation performed and discuss the results obtained. Finally, Section 6 presents the conclusions by pointing to our future research. The use of filename classification in recognizing CSEM has not received much attention despite its efficacy in identifying potential forensic evidence. To the best of our knowledge, only a few research papers have been published in recent years. To begin with, Panchenko et al. [7] attempted to normalize file names using Short Message Service (SMS) normalization techniques proposed by Beaufort et al. [21] . With the normalized text, they trained a Support Vector Machine (SVM) classifier and obtained an accuracy of 96.97% on their dataset. Peersman et al. [22] proposed a framework called iCOP to detect CSEM on P2P networks. The first stage of their classification pipeline was a dictionary-based filter that was constructed manually and held CSEM keywords. They used character n-gram of size two to four to capture more features about the file name and a binary SVM as classifier. Afterward, in their recent work [8] , Peersman et al. used a similar representation but benchmarked more classifiers, like SVM and Naive Bayes (NB). Due to the lack of a public dataset for this task, they evaluated their proposal on a custom dataset as well, and they observed that the SVM classifier could identify CSEM file names with a recall rate of 0.43. Al-Nabki et al. [18] compared the use of machine learning classifiers, such as SVM and LR, that use character n-gram with Term Frequency-Inverse Document Frequency (TF-IDF), versus deep learning classifiers that depend on Convolutional Neural Network (CNN). Specifically, they adopted two CNN models developed by Zhang et al. [23] and Kim et al. [24] . The model of Zhang et al. was the best benchmarked CNN-based classifier and obtained an F1 score of 0.85, while the machine learning approach using LR classifier scored a slightly lower F1 score of 0.84. The major difference was their processing time on a CPU machine where the latter was by far quicker than the former one. Pereira et al. [17] compared several machine learning and deep learning models to classify files using the file names and paths. They conducted the experiments on a dataset of 1, 010, 000 file paths from 55, 312 unique storage systems provided by Project VIC International. Similar to the conclusion of Al-Nabki et al. [18] , they found out that the CNN character-based model proposed by Zhang et al. [23] achieves the best recall rate of 0.94 The problem of CSEM identification through file names could be approximated to a wider research topic, such as short text classification [25, 26, 27, 28, 29, 30] , and in particular, news headlines classification and Twitter posts classification. The news headlines classification task attempts to group news articles based on their titles, in which the title typically is made up of a few words. Rana et al. [28] proposed a pipeline of three stages: data pre-processing, text representation, and classification. In the data pre-processing step, the text was tokenized into words, and spaces replaced special characters, stop words were removed, and the text was stemmed. For the text representation, the authors used TF-IDF, Information Gain (IG) [27] , and Boolean Weight (BW) [31] . Finally, in the classification stage, Rana et al. explored NB [32] , SVM [33] , K-Nearest Neighbor (KNN) [34] , and Decision Trees (DT) [35] . However, the core difference between our problem and news headlines classification is that the latter has high-quality input text, where the punctuation marks are maintained correctly, and there are no misspelled words. Classifying tweets of Twitter would also fall under the umbrella of short text classification as the common length of a tweet is 33 characters, while the maximum number of characters is 280 [36] . Furthermore, the quality of the text could be low in comparison to the news headlines problem, and it might contain abbreviations to save space or misspell some words [36] . Imran et al. [37] pre-processed the tweets by removing hyperlinks, mentions, and stop words. Then, they used the N-grams and IG techniques for feature extraction and a Random Forest (RF) classifier [38] . Chen et al. [39] proposed a framework to identify cyberbullying on Twitter. For text representation, they compared pre-trained language models, like Word2Vec [13] and GloVe [14] , with traditional text encoding techniques, such as TF-IDF, and they realized a decline in the performance when embedding-based were used. For classification, they compared traditional machine learning classifiers like LR and SVM with deep learning classifiers, like Long Short-Term Memory (LSTM) [40] and Convolutional Neural Network (CNN) [23] . Carnevale et al. [41] proposed an algorithm to classify noisy and low-quality text generated from critical patients' posts on Twitter. The authors employ n-gram with TF-IDF for feature extraction and benchmark two classifiers, SVM and NB, to compare their performance on the task. Furthermore, the problem of file path classification could be treated as a branch of URL classification since both share similar characteristics in terms of the structure and the use of concatenated words. This topic has been investigated widely by many researchers [42, 43, 44, 45, 46, 47] . Sahingoz et al. [46] used a URL classification approach to identify phishing websites through their addresses. Sahingoz et al. explore various features extracted manually from the URL, and they used them to benchmark several machine learning classifiers, such as DT, SVM, and RF. Trevisan et al. [45] examined the use of Generative Adversarial Neural Networks (GANs) to classify four classes of URLs given its ability to cope with the lack of training samples problem. This section introduces two approaches for designing the File Classifier (FC) in order to identify CSEM. The first one involves two standalone classifiers, one for the file name and another for the file path, and the outputs of these two classifiers are fused to a single value that represents the prediction confidence. The second approach is to build a single classifier for the file names. Since the path consists of a sequence of file names, a file name classifier can iterate over the sub-directory names starting from the root directory to the file name. Finally, the prediction confidences of the sub-directories are fused. Both approaches have a typical machine learning design that consists of three main stages [5] : text pre-processor, feature extractor, and a classifier. In the following, we elaborate on each approach in detail. In the following, we present two classifiers, a File Name Classifier (FNC) and a File Path Classifier (FPC). Each classifier has its dataset for training and testing. The FNC presented in this paper attempts to enhance the previous implementations explored in the literature by: 1) enhancing the file name representation and 2) training on a larger and more representative dataset (Section 4.1). File Name Feature Extraction. Finding an adequate representation of the text is a crucial step in a classification pipeline. For this work, we used character n-gram to extract all the patterns of two to five consecutive characters of an input file name, which builds a set of tokens. Then, we apply the well-known TF-IDF technique [48] since it gives higher weight scores to grams whose frequency is higher in a few file names and, at the same time, decreases the weight of grams that frequently occur in many files. This way, it overcomes the issue of misspelled words or personalized Table 2 shows an example of two to five grams of a file name "!!!!yoB0yXX ". Furthermore, to discard noisy tokens, we set thresholds for the minimum and the maximum term frequency. File Name Classification. After having the features extracted, we use them for training the FNC. Based on previous research [18] and considering both classification performance and execution time, we use Logistic Regression. The FPC is a supervised binary classifier to decide whether a given file's absolute path is CSEM related or not. The FPC consists of the following three components: File Path Pre-processing. Only the file paths are pre-processed at this stage since the FNC already handled the file names. Initially, the path is converted into a string by replacing the slash sign (/) with space. Next, we replace special characters and digits by # and $, respectively. Finally, using the regular expression library, we split the text by capital letters if exist. Table 3 illustrates the pre-processing procedure applied to two samples. File Path Feature Extraction. The problem of path classification is similar to the file name classification. In both cases, we could not use pre-train models because most of the text will be out of vocabulary and will not be represented properly. For this reason, we used the same feature extraction technique we used for the FNC, i.e. using n-gram technique, between two to five grams, that works on the character level. We applied it along with TF-IDF algorithm, as described in Section 3.1.1. File Path Classification. After having the features extracted from the file paths, we use them for training a binary supervised Logistic Regression classifier, which will identify CSEM paths from the regular ones. This section aims to present how we aggregate the prediction of the two classifiers, the FNC and the FPC, into one prediction value. The desired fusion strategy must be sensitive to potential CSEM, either in the file name or the file path. Hence, our fusion strategy returns the result of the classifier, which has the highest CSEM confidence. For example, for a given sample x, the FNC predicts it is CSEM with 20% confidence and 80% otherwise, while the FPC predicts it is CSEM with 40% and Non-CSEM with 60%. In this case, the FPC confidence for the CSEM is higher than the FNC's confidence, and therefore the result of the FPC will be the final output of the fusion. Formally, Eq. 1 explains the following procedure. where F C (x) refers to the classification result of a sample x, and the F NC (x) C SE M and F PC (x) C SE M refer to the classifier confidence regarding the CSEM class. Typically, the absolute path of a file is made up of a sequence of folder names. This approach considers that each folder is a standalone file name, and it uses the previously implemented FNC (presented in Section 3.1.1) to classify it. Therefore, if an entry path has N sub-directories, including the file name, the FNC will be called N times and classify N entries. If any of these N entries were reported as CSEM, the complete path is considered CSEM. Otherwise, the entry is considered as non-CSEM. Unlike the majority voting approach, this technique is highly sensitive to any suspicious sub-directory name mentioned in the input path. The prediction complexity of this approach is proportional to the depth of the absolute path. Hence, for M samples and each has N sub-directory, the complexity would be O(N × M ). Another motivation to split up on these two themes is that a unique path could contain hundreds of files, resulting in hundreds of file name samples. Table 4 shows five file names that refer to two unique paths. Furthermore, we noticed that the lack of explicit CSEM-related words distinguishes the paths. Instead, considering the words of the whole path sub-directory at once may lead to suspicious content of that path. To illustrate this, Table 4 gives two unique path samples. In the first example, the word "Sarah" standalone or "Silver Starlets" are not CSEM-related, but their existence with other directories named "Starlets", "skirt", and the number five ( the last directory of the first example) could be an indicator of a sequence of photos for a 5-years old girl dressing a pink skirt. For the negative class, i.e. the safe files, we used a dataset published by the National Software Reference Library (NSRL) 3 that contains more than 32 million file names. We selected an initial subset of 800, 000 Non-CSEM examples, resulting in 537, 807 after applying the pre-processing step. Regarding the CSEM class, we collected these examples thanks to the collaboration between the Spanish National Cybersecurity Institute (INCIBE) 4 and the Spanish LEAs. This latter provided us with a list with dumps of hard disks seized from criminals' computers. The list had 90, 000 CSEM samples. However, after pre-processing them, the number decreased to 37, 648 unique instances. Similar to the file name classifier, the file path classifier has two classes, CSEM and Non-CSEM. For the Non-CSEM class, we gathered 3, 031, 802 unique paths for dumps of eight computer machines that host Non-CSEM files and 2, 864, 105 for the CSEM class that was provided to us by the Spanish LEA. After pre-processing these paths, we ended with 2, 065, 590 unique instances distributed as 924, 445 and 1, 141, 145 for the CSEM and the Non-CSEM classes, respectively. The experiments were carried out on a PC with an Intel(R) Core(TM) i7 processor with 32 GB of RAM under Windows-10. We used Python3 with Scikit-Learn 5 for implementing the classifiers. Regarding the File Name Classifier's configuration, we used character n-grams, extracting patterns from two to five grams [18] . Also, we set thresholds for the minimum and the maximum gram proportion to 0.999 and 0.0005, respectively. For the LR classifier, we set the parameter C to 100, empirically, which refers to the inverse of regularization strength, and we activate the class weight parameter to consider the imbalance of the classes while training. The rest of the parameters were left to their default values, as the Scikit-Learn library set them. The File Path Classifier used the configurations as the File Names Classifier. To estimate the models' performance, we report the performance of each classifier on a test set. For the File Name Classifier, the dataset has 890, 000 samples before pre-processing and we split by 80/20 for the training and the testing sets, respectively. Detailed description of the dataset size information is given in Table 5 . Unlike the file name samples, we could not split the training and the testing set on a fixed percentage, and this is because the samples of these sets must be nonintersected. Hence, the machines used to obtain the file paths' dump are distinct. Table 6 gives detailed information about the dataset class size. Finally, to test the performance of both file name and file path models, we created a binary dataset of 50, 000 samples, equally distributed between the classes. We sampled 50, 000 file paths and another 50, 000 file names of the test sets of the path names and the file names sets randomly, respectively. Then, we created a balanced synthesized test set by fusing these two sets. A sample is considered CSEM if its name or path were sampled from a CSEM instance; otherwise, it is tagged as Non-CSEM. The principal objective of this work is to assist LEAs in detecting CSEM through their file names, avoiding the exposure of an agent to CSEM. Therefore, it is desirable to have a low number of false negatives -a file named with CSEM content identified as a Non-CSEM -than a low number of false positives, i.e., Non-CSEM file name wrongly categorized as a CSEM. Hence, it is desirable to obtain a high recall of the CSEM class rather than the Non-CSEM class. Recall metric for a class is calculated as the total number of samples correctly classified for that class (the True Positives TP), over the total number of samples of that class (the True Positives TP and the False Negatives FN) . Equation (2) shows how Recall is estimated for a given class. Nevertheless, the precision of a classifier is also a crucial factor in measuring its performance, as it shows the proportion of correctly identified samples. Class precision is calculated as a ratio of correctly classified file names of that class (the True Positives TP) to the total number of predicted positive samples of that class (the True Positives TP and the False Positives FP), and it is given in Equation (3). T P T P + F P . Finally, the F1 score of a class summarizes the two before-mentioned metrics as it refers to the harmonic mean of the precision and recall and it is calculated following to Equation (4). Additionally, it has been proved that the accuracy metric is not reliable when the dataset is imbalanced [49] , as in our case, where the majority of the samples are Non-CSEM file names. An alternative metric is to use average class recall, rather than using overall dataset level accuracy. In this section, we evaluate both classifiers and the proposed fusion methods, as described in Section 3. Table 7 analyzes the impact of these features used to boot the file name representation. Our results show that when all the representation techniques are joined, we could obtain the best classification performance for the FNC with an average class recall of 0.98 and an F1 score of 0.96. Afterward, we evaluated the FPC on its test set, as shown in Table 8 . The FPC obtained 0.97 for both of the average class recall and the F1 score, which is slightly higher than the FNC, which scored 0.98 and 0.96, respectively. In addition to reporting the performance of each classifier individually, we analyze two techniques of fusing them, as described earlier. Table 9 shows that using two standalone classifiers, one for the file path and one for the file name, surpasses the single iterative classifier approach. The two classifiers architecture could achieve an average class recall of 0.98, which is higher than the other approach that iteratively uses the FNC and scores 0.74. In this paper, we presented a supervised machine learning approach to identify files that may contain Child Sexual Abuse Material (CSEM) from regular files (Non-CSEM). Given that this work aims to build a fast CSEM prediction, only file names and paths are used. We proposed two solutions: 1) building two standalone classifiers: a File Name Classifier (FNC) and File Path Classifier (FPC), and then fusing their outputs into a single decision, and 2) dividing the file path into a list of folder names and using the FNC to classify each name in the path. Our results strengthen the superiority of the former approach as it obtained an average class recall of 0.98, while the latter scored an average class recall of 0.74. For the FNC, we pre-processed the text and boosted it with two features: binary and orthography, which increased the recall rate of the CSEM class from 0.89 to 0.93 and scored an average class recall of 0.98. Regarding the FPC, it used similar architecture to the FNC, but it differs in the pre-processing stage, and it achieved an average class recall of 0.97. The empirical evaluation was conducted on a dataset extracted from the file names and file paths. As future work, we are looking forward to enlarging the dataset by obtaining samples from various seized computers, allowing the model to be exposed to wider CSEM file names patterns. Furthermore, once the dataset is extended, we aim to build a character-based language model [50] for CSEM files. The assessment of transformer-based models, such as BERT [15] , and XLNet [51] for text classification is part of our immediate future research, as they have shown promising results on various NLP tasks. Torank: Identifying the most influential suspicious domains in the tor network Classification of illegal activities on the dark web Classifying illegal activities on tor network based on web textual contents Statistical detection of downloaders in freenet Detection of child sexual abuse media on p2p networks: Normalization and classification of associated filenames icop: Live forensics to reveal previously unknown criminal media on p2p networks, Digital Investigation Pornography and child sexual abuse detection in image and video: A comparative evaluation Improving speed-accuracy trade-off in face detectors for forensic tools by image resizing Textile retrieval based on image content from cdc and webcam cameras in indoor environments Distributed representations of words and phrases and their compositionality Glove: Global vectors for word representation An active learning based on uncertainty and density method for positive and unlabeled data Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Metadata-based detection of child sexual abuse material Aláiz-Rodrıguez, File name classification approach to identify child sexual abuse Improving named entity recognition in noisy user-generated text with local distance neighbor feature Learning orthographic features in bi-directional lstm for biomedical named entity recognition A hybrid rule/model-based finite-state framework for normalizing sms messages icop: Automatically identifying new child abuse media in p2p networks Character-level convolutional networks for text classification Character-aware neural language models Short text classification in twitter to improve information filtering Short text classification using very few words Feature selection via maximizing global information gain for text classification News classification based on their headlines: A review Term weighting scheme for short-text classification: Twitter corpuses tax2vec: Constructing interpretable features from taxonomies for short text classification A rough set-based approach to text classification Some effective techniques for naive bayes text classification Text categorization with support vector machines: Learning with many relevant features A brief survey of text mining A survey of decision tree classifier methodology Twitter's doubling of character count from 140 to 280 had little impact on length of tweets -techcrunch Cross-language domain adaptation for classifying crisis-related short messages Random forests Verbal aggression detection on twitter comments: Convolutional neural network for short-text sentiment analysis Sequential short-text classification with recurrent and convolutional neural networks Investigating classification supervised learning approaches for the identification of critical patients' posts in a healthcare social network Online url classification for large-scale streaming environments A novel lightweight url phishing detection system using svm and similarity index A naive bayes approach for url classification with supervised feature selection and rejection framework Robust url classification with generative adversarial networks Machine learning based phishing detection from urls Malicious url classification using machine learning algorithms and comparative analysis An information-theoretic perspective of tf-idf measures Harnessing the power of text mining for the detection of abusive content in social media Cross-lingual language model pretraining Generalized autoregressive pretraining for language understanding This work was supported by the framework agreement between the University of León and INCIBE (Spanish National Cybersecurity Institute) under Addendum 01. This research has been funded with support from the European Commission under the 4NSEEK project with Grant Agreement 821966. This publication reflects the views only of the author, and the European Commission cannot be held responsible for any use which may be made of the information contained therein.