key: cord-0523744-nch1wkb9 authors: Pereira, Mayana; Dodhia, Rahul; Anderson, Hyrum; Brown, Richard title: Metadata-Based Detection of Child Sexual Abuse Material date: 2020-10-05 journal: nan DOI: nan sha: 63027b66f8c0f31110aa079e99bd19af41d5f318 doc_id: 523744 cord_uid: nch1wkb9 Child Sexual Abuse Media (CSAM) is any visual record of a sexually-explicit activity involving minors. CSAM impacts victims differently from the actual abuse because the distribution never ends, and images are permanent. Machine learning-based solutions can help law enforcement quickly identify CSAM and block digital distribution. However, collecting CSAM imagery to train machine learning models has many ethical and legal constraints, creating a barrier to research development. With such restrictions in place, the development of CSAM machine learning detection systems based on file metadata uncovers several opportunities. Metadata is not a record of a crime, and it does not have legal restrictions. Therefore, investing in detection systems based on metadata can increase the rate of discovery of CSAM and help thousands of victims. We propose a framework for training and evaluating deployment-ready machine learning models for CSAM identification. Our framework provides guidelines to evaluate CSAM detection models against intelligent adversaries and models' performance with open data. We apply the proposed framework to the problem of CSAM detection based on file paths. In our experiments, the best-performing model is based on convolutional neural networks and achieves an accuracy of 0.97. Our evaluation shows that the CNN model is robust against offenders actively trying to evade detection by evaluating the model against adversarially modified data. Experiments with open datasets confirm that the model generalizes well and is deployment-ready. International law enforcement handles millions of child sexual abuse cases annually. In 2017, child abuse hotlines received and reviewed over 37 million child sexual abuse files worldwide [5] . Despite the 2008 Protect our children act [19] , the quantity of CSAM in digital platforms has dramatically grown over the last decade [29] . Online sharing platforms and social media facilitated [23] the explosive growth of CSAM creation and distribution [6] . Every platform for content searching and sharing, including social material, likely has CSAM on it [14] . Recent studies have reported alarming statistics, 25% of girls and 17% of boys in the U.S. have experienced some form of sexual abuse before the age of 18 [26] . As the scale of the problem grows, technology plays an essential role in CSAM identification. Companies that manage user-generated data, such as Pinterest, Facebook, Microsoft, Apple, and Google, have made detection and removal of CSAM a top priority. Although several non-profit organizations such as Project VIC International 1 , Thorn 2 and 1. https://www.projectvic.org 2. https://www.thorn.org the Internet Watch Foundation 3 focus on building tools to combat CSAM proliferation, the creation and distribution of CSAM is still a growing problem. The COVID-19 pandemic triggered a significant increase in the distribution of CSAM via social media and video conferencing apps [24] . The identification of CSAM is a highly challenging problem. First, it can manifest in different types of material: images, videos, streaming, video conference, online gaming, among others. Undiscovered and unlabeled CSAM on the internet is estimated to be magnitudes greater than the currently identified CSAM. Second, discovering new material is still highly dependent on human discovery. Despite the significant progress in machine learning models for CSAM identification with modern deep-learning architectures [15] , [27] , [32] , these models rely on the availability of labeled images, which can lead to technical limitations. As new material is created daily, we understand that utilizing complementary signals can advance the capability of digital platforms in detecting and removing illegal content. The use of metadata has been proposed in the past by [22] . This is an effective approach since distributors use coded language to communicate and trade links of CSAM hosted in plain sight on content sharing platforms such as Facebook, YouTube, Twitter, and Instagram. The scarcity of frameworks for training and evaluation deployment-ready machine learning models for CSAM detection prevents a broader adoption of machine learning 3 . https://www.iwf.org.uk arXiv:2010.02387v2 [cs. LG] 27 Oct 2021 pipelines for CSAM detection. Before deployment, organizations should test the CSAM detection model under different conditions. An evaluation scenario needs a real-world dataset with similar data distributions to what the model will get exposed to after deployment. A critical scenario for analysis is testing the model on completely benign out-ofsample datasets. The burden caused by a high false-positive rate can result in halting the deployment of such systems. Furthermore, it is crucial to understand how adversarially modified data samples impacts model performance. Building machine learning systems for detecting CSAM media is a complex task. Due to the associated legal constraints, systems that relies on metadata for detecting and blocking the distribution of CSAM can expedite the hard work of NGOs and content moderators. This work proposes a framework for training and evaluation of deploymentready machine learning models for CSAM detection based solely on file metadata. The proposed models compute the likelihood that a file path is associated with CSAM. Our experiments show that the resulting model not only does not rely on the availability of CSAM images and videos for model training but also achieves desired performance in challenging real-world deployment scenarios. We list the contributions as follow: • We propose a training and evaluation framework to develop deployment-ready machine learning models for CSAM text classification tasks, including file path classification, website keyword classification, search terms classification and others. Our framework includes a model training pipeline that covers several text representation and machine learning techniques. Our testing pipeline, illustrated in Figure 1 , covers real-world scenarios that should be expected when deploying a machine learning model for CSAM detection: (i) test on out-of-sample csam and non-csam samples; (ii) test on adversarially modified CSAM samples to evade detection; (iii) test on benign samples from open data sources. • We train and compare several machine-learning models that analyze file paths and file names from file storage systems and determine a probability that a given file has child sexual abuse content. We train our models on a real-world dataset containing over one million file paths. It is the most extensive file path dataset ever used for CSAM detection to date. Our best classifier achieves recall rates over 0.94 and accuracy over 0.97 on holdout sets; it maintains a high recall rate in adversarially modified inputs; when tested against benign samples from other data distributions, it achieves a false-positive rate of ≈ 0.01. To our knowledge, our work is the first to propose a framework for training and evaluation of deployment-ready CSAM detection systems that include adversarial examples in the evaluation stage. Our results show that machine learning based on file paths can effectively detect CSAM in storage systems and achieve the aspired performance in all the proposed evaluation scenarios. Identification of CSAM via statistical algorithms is a reasonably recent approach. In the early 2000s, the US and the UK introduced laws targeting online exploitation of minors (COPA in the US, Crime and Disorder Act UK) [7] . However, only in 2008, the first widely used technology for CSAM identification was released. PhotoDNA Hash PhotoDNA Hash (PDNA) is a widely used technique for automated identification of CSAM. The PDNA uses a fuzzy hash algorithm to convert a CSAM image to a long string of characters. The converted hashes are compared against other hashes to find identical or similar images. PDNA technology enabled a faster discovery of CSAM while protecting the victim's identity. This system is still one of the most widely used methods for detecting CSAM images worldwide. Search engines, social networks, and image sharing services utilize databases of hashed CSAM images to eradicate harmful content from their platforms. PDNA is a signature-based technology; it recalls only known CSAM. Therefore, identifying new CSAM in a PDNA-based system requires manual labeling. Machine Learning for Image Identification Since PDNA's first development, computer vision models have undergone a revolution resulting in novel machine learningbased models for pornography and CSAM detection [15] , [18] , [21] , [32] . The current approaches either combine a computer vision model to extract image descriptors [27] , train computer vision models on pornography data [9] , perform a combination of age estimation and pornography detection [15] or synthetic data [32] . However, due to legal restrictions in maintaining a database of CSAM images, all current works are based on either unrealistic images [32] , or validated by authorities in small datasets [9] , [15] , [27] that hardly represents the true data distribution in the internet [6] . Adversarially modified data samples Adversarial inputs are intentionally crafted small perturbations to malicious inputs to elude detection from a model. For text applications, this can include injecting random noise that does not dramatically alter the understanding by a human. Substitutions such as replacing "before" with "b4", homoglyph substitutions, and other substitutions, such as using "Lo7ita" instead of "Lolita" [31] . The effects of adversarial modifications in text classification have been explored for different NLP techniques, including classification [1] , machine translation [3] and word embeddings [11] . Depending on what kind of information is available to the adversary, it distorts portions of the text most likely to contain a signal important to the classification task. CSAM File Metadata Classification While significant efforts have focused on the images themselves, some researchers have looked for complementary signals to help CSAM identification. Such measures include queries that return CSAM in search engines, file metadata, and conversations that imply grooming or exchange of CSAM [25] . Other efforts have used textual signals to identify where CSAM might be located, such as keywords related to website content [28] , using NLP analysis [2] , [20] , [22] , conversations [4] . Our work falls into this category. Previous works have found that perpetrators tend to use a specific CSAM vocab- ulary to name files [22] . For this reason, using file paths, which is the combination of the file location and file name, is a promising approach for CSAM identification. To the best of our knowledge, there is one related work that aims to identify CSAM based solely on file path [2] . However, it does not address important questions such as classifiers robustness against adversarial examples and performance in out-of-sample benign datasets. In this section we provide an overview of the methods and algorithms utilized in our experiments. We present two different concepts utilized for text vectorization: term-frequency inverse-document-frequency (TF-IDF) and character-based quatization. TF-IDF This techniques attributes weights to words (or sequences of characters) in a text [13] . First it computes the term-frequency (TF), which is the number of times a term occurs in a given document. The inverse-document frequency-component (IDF) is computed as: where n is the total number of documents in the document set, and df (t) is the number of documents in the document set that contain term. For each term it is computed the product of the TF and IDF components. The resulting TF-IDF vectors are then normalized by the Euclidean norm. This type of text preprocessing is starts by defining an alphabet of size m as the input language, quantized using 1-of-m encoding. Each textual input, of length l, is then transformed into a sequence of such m sized vectors with fixed length l. Texts with more than l characters are truncated and the exceeding initial characters are discarded. If the text is shorter than l, it is padded with zeroes on the left.Characters that are not in the alphabet are quantized as all-zero vectors. We use several learning algorithms that have been successfully applied short text classification. We consider two broad categories of approaches: i) Traditional machine learning models, ii) and Neural networks models. Logistic Regression This classification algorithm is a discriminative classifier that models the posterior probability P (Y |X) of the class Y given the input features X by fitting a logistic curve to the relationship between X and Y . Model outputs can be interpreted as probabilities of the occurrence of a class [17] . Naive Bayes Conditional probability model that assumes independence of features: given a problem instance to be classified, represented by a vector x = (x 1 , . . . , x n ) representing some n features, it assigns to this instance probabilities P (C k | x 1 , . . . , x n ) for each of K possible outcomes or classes C k . The problem with the above formulation is that if the number of features n is large or if a feature can take on a large number of values, then basing such a model on probability tables is infeasible. The model is reformulated to become more tractable. Using Bayes' theorem and assuming independence of the feature variable's, the conditional probability can be decomposed as: Boosted Decision Trees Model based on ensembles of trees, where each tree is trained using a boosting process in which each subsequent tree is built with weighted instances which were misclassified by the previous tree [8] . Classification of a new instance with a trained ensemble of trees is based on a simple majority vote of the individual trees. Convolutional Neural Networks One-dimensional Convolutional Neural Networks (CNNs) are a good fit when the input is text, treated as a raw signal at the character level [33] . CNN's automatically learn filters to detect patterns that are important for prediction. The presence (or lack) of these patterns is then used by the quintessential neural network (multilayer perceptron) to make predictions. These filters, (also called kernels) are learned during backpropagation. Long Short-Term Memory network This flexible architecture generalizes manual feature extraction via n-grams, for example, but instead learns dependencies of one or multiple characters, whether in succession or with arbitrary separation. The long short-term memory network (LSTM) layer can be thought of as an implicit feature extraction instead of explicit feature extraction (e.g., n-grams) used in other approaches. Rather than represent file paths explicitly as a bag of n-grams, for example, the LSTM learns patterns of tokens that maximize the performance of the second classification layer. In this section we describe the training dataset and a detailed description of our file path classifiers. Our supervised learning approach to identify CSAM file paths utilizes a binary labeled dataset (CSA versus non-CSA). To separate the dataset into independent training and test sets, we split the data by storage system information (e.g., drive letter designations) do not leak information from the training to test set. Our dataset consists of real file paths collected by Project VIC International 4 . The data consists of 1,010,000 file paths from 55,312 unique storage systems. 4 . https://www.projectvic.org File paths are strings that contain location information of a file (folders) in a storage system and the file name. In Table 1 we present details on the different types of content that constitute the dataset and the number of samples for each type of content. We map the multiple classes provided by Project VIC into binary labels (CSAM vs. non-CSAM). Labels 1,2 and 3 are mapped to CSAM (292,552 file paths); label 0 is mapped to non-CSAM (717,448 file paths). File Path Characteristics The distribution of file path length help us define the size of the character embedding vectors in our deep neural networks models. Figure 2 shows the distribution of file path lengths in the dataset. Only 4,685 file paths have more than 300 characters. Our model takes as input vectors of at most 300 characters. For file paths with more than 300 characters, we truncate the file path by discarding the initial characters and keeping only the last 300 characters. For file paths with less than 300 characters, we pad with zeros on the left. Cross Validation Data Split We use a K-Fold Cross Validation methodology in our experiments with K=10. To guarantee the independence of file paths in the different partitions of the data, we create the random data folds by splitting the data by storage system information, as illustrated in Figure 3 . The information before the first backlash of a file path specifies the external storage system or a laptop/desktop. This information is used to partition the dataset for crossvalidation. Our work investigates three approaches for CSAM file path classification. 1 Bag-of-words This approach encodes the file path string into a vector of words. The weights of the words are attributed using TF-IDF. We utilize the resulting vectors as input to traditional machine learning classifiers (logistic regression, boosted decision trees, and Naive Bayes). 2 Character N -grams A list of character sequences on size N encodes the file path. The weights of the sequences are attributed using TF-IDF. The resulting vectors of the character sequences are used with traditional machine learning classifiers (logistic regression, boosted decision trees, and Naive Bayes); 3 Character quantization Sequences of encoded characters are used with a convolutional neural network (CNN) and a long short-term memory network (LSTM). Bag-of-Words For each file path, we consider a word to be a sequence of alphanumeric characters that are separated by a dash, slash, colon, underscore or period. The bag-ofwords model is constructed by selecting the 5,000 most frequent words from the training subset. We utilize this text representation in combination with TF-IDF. The dataset of vectorized file paths is used as input to three different learning algorithms: logistic regression, naive Bayes and boosted decision trees. Bag-of-Ngrams We extract from each file path string its n-grams, for n ∈ {1, 2, 3}. The set of n-grams of a string s is the set of all substrings in s of length n. We construct the bag-of-ngrams models by selecting the 50,000 most frequent n-grams (up to 3-grams) from the training dataset. We utilize this text representation in combination with TF-IDF. The dataset of vectorized file paths is used as input to three different learning algorithms: logistic regression, naive Bayes and boosted decision trees. Neural Networks The alphabet used in all of our models consists of m = 802 characters, including English letters, Japanese characters, Chinese characters, Korean characters, and special alphanumeric characters. The alphabet is the set of all unique characters in the training data. All neural network architectures start with an embedding layer that represents each character by numerical vector. The embedding maps semantically similar characters to similar vectors, where the notion of similarity is automatically learned based on the classification task at hand. The variant of LSTM architecture used in our work the common "vanilla" architecture as used in [30] . Figures 4 and 5 show detailed information of both architectures(charCNN and charLSTM), including data dimensions and the number of weights in each layer. Through out this paper, we will refer to our CNN-based neural network and our LSTM-based neural network as charCNN and charLSTM, respectively. We present our results for all our classifiers in Table 2 . All performance metrics were measure using a 10-fold crossvalidation on the Project VIC dataset. For each of our classifiers, we report the mean and the standard deviation over the folds for the area under the ROC curve (AUC), accuracy, precision, and recall for predicting CSAM files. We focus on two primary metrics for model comparison: Recall and AUC. Additionally, we assess all machine learning models' generalization by looking into the standard deviations over the cross-validation folds. Traditional Machine Learning Models There are significant advantages of traditional machine learning models in comparison with deep neural networks. Understanding how well these models perform can help scientists and investigators leverage such models' most remarkable characteristics: feature interpretability. The most relevant predictive tokens, or n-grams, can give clues about vocabulary words in the dataset and utilize it in other CSAM detection systems. In table 2, we observe that the model trained with bagof-words and bag-of-ngrams operates in similar AUC and accuracy ranges. When analyzing recall rates of traditional models, we note that both naive Bayes models have the highest average rates and lowest standard deviations. The naive Bayes with bag-of-ngrams features presents the best recall of all traditional models, of about 0.91. Among the other models trained using bag-of-ngrams, naive Bayes presents a much smaller recall standard deviation (σ = 0.085) when compared to logistic regression (σ = 0.20) and boosted decision trees(σ = 0.20). Although the evaluation of CSAM classification models heavily relies on recall rates, when deploying a model in an environment that potentially analyzes hundreds on thousands of file systems and consequently millions of file paths, precision can become the most significant metric. The burden of having several thousands of false positives can result in an inefficient process and potentially delay investigations and discovery of true positives. The AUC metric captures the ability of a classifier to operate with high recall when low false positive rates are necessary. By analyzing the traditional models' AUC, we observe that boosted decision trees overall performs better than the two other techniques. Neural Networks Models We were able to achieve the best performance across all categories with deep neural network architecture. We trained two different architectures: a layered CNN and an LSTM. The LSTM model achieves results very similar to the bag-of-words naive Bayes classifier. However, our CNN model consistently outperforms all the other models, both in mean performance metrics across all folds and in the lowest standard deviation. The recall of over 0.94 and precision over 0.93 makes this model an excellent candidate as an investigative tool in environments with large volumes of storage systems. Fig. 3 : Data split for K-fold cross validation. The data is partitioned by storage system ids. File paths from a same storage system are all assigned to a same data fold. The data folds in the orange shaded areas represent the training data in each iteration, and the data folds in the blue shaded areas represent the test data. Experiments with traditional machine learning and neural networks using Project VIC's dataset. We evaluate the AUC-ROC, accuracy, precision and recall. These results were measured across 10-folds in a cross-validation setting. For each metric, we report the mean (µ), and the standard deviation (σ). In classical machine learning applications, we assume that the underlying data distribution is stationary at test time. However, a testing pipeline of models aimed at detection of illegal activities should anticipate an intelligent, adaptive adversary actively manipulating data. We know perpetrators purposely add typos and modifications to file identifiers [22] to evade blocklists and machine learning-based detection mechanisms. We modify our test dataset to simulate an adversary actively changing the file paths to elude the classifiers. In our testing scenario, the adversary can send file paths to the model, but it does not have access to model outputs. The only allowed action to the adversary is to make changes to the file paths. Although the adversary is willing to change the file paths to evade detection, it is not interested in completely changing the file path's words (and meaning). The files are often shared among perpetrators, and the file name is usually used to identify file contents. Therefore, the adversary wants to make the maximum amount of changes without compromising the human comprehension of the meaning of the string of characters. The number of modifications in the file path is called the adversary's budget. F : X → Y, which maps the input space X to the set of labels Y, a valid adversarial example x adv is generated by altering the original data example x ∈ X and should conform to the following requirement: S(x adv,x ) ≤ , where S : X × X → R + is a similarity function and ∈ R + is a constant modeling the budget available to the adversary, i.e. the allowable size of a modification. Threat Model In our threat model, the attacker is not aware of the model architecture, parameters and does not have access to the confidence scores predicted by the model. The attacker attempts to cause an integrity violation in the model by modifying the input under bounded perturbation size: a hard-label black-box evasion scenario [10] . The only knowledge the adversary has about the model is the input space X and the output space Y. In our experiments we modify the test dataset based on two variations of the threat model, shown in Figure 6 . Attack 1: Random Substitutions The adversarial examples are generated by randomly selecting a position in the file path string and substituting the character in the selected position by a random alphanumeric character. This technique has been previously used to attack language models [3] . We evaluate our models for an adversarial budget of 10%, 15%, and 20% of file path length. Attack 2: CSAM Lexicon Substitution We assume the adversary has access to a CSAM lexicon, a list of words used by perpetrators to mark media containing CSAM. The adversary uses the CSAM lexicon to identify words in the file path that indicates the presence of CSAM. As illustrated in Figure 6 , in this attack, the adversary first locates a word from the lexicon in the file path and then substitutes characters. The file path modification occurs as follows: The adversary verifies which words in the CSAM lexicon are present in the file path. For every word, the adversary randomly modifies one character by randomly selecting a position in the word string and substituting the character in the chosen position by a random character. The number of allowed word replacements is determined by the budget . For our experiments, we create the CSAM lexicon list using Odds Ratio, a widely used technique in information retrieval, and used for feature selection and interpretation of text classification models [16] . First, we identify which tokens are more likely to appear in CSAM file paths. We extract all tokens from the dataset of file paths as described in section 4.2.1. We calculate the odds of the keyword being part of a CSAM file path and the odds of the keyword being part of a non-CSAM file path for all keywords. The Odds ratio of a word w is computed as: Odds Ratio = odds of w appear in CSAM file odds of w appear in non-CSAM The CSAM lexicon comprises all keywords with Odds Ratio greater than two. We make this list available to the adversary. The adversarial modifications are done in the test fold of the 10-fold cross-validation on the Project VIC dataset. We evaluate the impact of adversarial modifications in test samples on the model's performance. We are interested in i) understanding which machine learning techniques are more robust when the data is adversarially modified at test time, ii) and how much the performance of the models change. All attacks target only CSAM file paths, and therefore we only evaluate the variation in recall rates. Additionally, we analyze the mean deviation in confidence scores for all models. Random Substitutions Under this scenario, an adversary randomly modifies a percentage of the file path by randomly selecting characters and replacing them with random characters. A reasonable adversary budget in this scenario is between 10 % and 15%. Previous works have also considered this same percentage range for perturbing text strings [12] . Considering that most file paths have a length between 40 and 200 characters, this results in changing between 6 and 30 characters in each file path. To stress-test our models, we also analyze the performance of our models under a 20% change. Table 3 shows details on the confidence score variations for different adversarial budgets. Figure 7 demonstrates the variation in recall rates as the percentage of random flipped characters increases. For an adversarial budget of 15%, we observe a decrease in recall rates of 0.02 for bag-of-words and naive Bayes and 0.07 in the CNN model. Interestingly, bag-of-ngrams naive Bayes suffers an increase in recall rates after flipping a percentage of the characters in the file path. This phenomenon results from the fact that modifications in the file paths can also increase the model's confidence score output. The boosted decision trees models undergo the most significant decrease in recall rates. A possible model overfitting can explain this to this specific dataset distribution. When flipping 20% of characters, the deep neural networks models' recall rates decrease ≈ 0.1, and bag-of-words naive Bayes decreases ≈ Model Evaluation II -model confidence decrease under adversarial inputs. Experiment with random substitution attack, for adversarial budget of 10%, 15% and 20%. We evaluate the mean decrease in model cofidence (µ) and the model confidence decrease variance (σ) for traditional machine learning algorithms -Logistic Regression (LR), Naive Bayes (NB), and Boosted Trees (BT) and neural networks -Convolutional Neural Networks (CNN) and Long Short-Term Memory Network (LSTM). The text preprocessing methods used use traditional machine learning algorithms are bag-of-words (BW) and bag-of-n-grams (BN). The text preprocessing method used with neural networks algorithms is character quantization (CQ). Fig. 7 : Model Evaluation II -recall variation as adversary budget increases in a random substitution attack. Our best performing model, a charCNN, suffers a decrease in recall as the adversary budget increase, yet the recall is still over 0.80 even when the 20% of the characters of a CSAM file path is modified. Surprisingly, the model trained using Naive bayes (with Bag-of-Words as the text preprocessing technique) suffers an increase in recall as the adversary budget increase. 0.05, and bag-of-ngrams naive Bayes, once again presents an increase in its recall rate. Figure 7 demonstrates the variation in recall rates as the percentage of random flipped characters increases. For an adversarial budget of 15%, we observe a decrease in recall rates of 0.02 for bag-of-words and naive Bayes and 0.07 in the CNN model. Interestingly, bag-of-ngrams naive Bayes suffers an increase in recall rates after flipping a percentage of the characters in the file path. This phenomenon results Model Evaluation II -model confidence decrease under adversarial inputs. Experiment with CSAM lexicon substitution attack, for adversarial budget of 1, 2, 3 and 4 characters. We evaluate the mean decrease in model cofidence (µ) and the model confidence decrease variance (σ) for traditional machine learning algorithms -Logistic Regression (LR), Naive Bayes (NB), and Boosted Trees (BT) and neural networks -Convolutional Neural Networks (CNN) and Long Short-Term Memory Network (LSTM). The text preprocessing methods used use traditional machine learning algorithms are bag-of-words (BW) and bagof-n-grams (BN). The text preprocessing method used with neural networks algorithms is character quantization (CQ). from the fact that modifications in the file paths can also increase the model's confidence score output. The boosted decision trees models undergo the most significant decrease in recall rates. A possible model overfitting can explain this to this specific dataset distribution. When flipping 20% of characters, the deep neural networks models' recall rates decrease ≈ 0.1, and bag-of-words naive Bayes decreases ≈ 0.05, and bag-of-ngrams naive Bayes, once again presents an increase in its recall rate. CSAM Lexicon Substitution Having access to a list of terms highly correlated with CSAM file paths, permits the adversary to make targeted changes in the CSAM file paths. This experiment allows the adversary to change one character per keyword, up to 4 keywords per file path. Figure 8 illustrates the recall variation as a function of the adversarial budget As indicated in figure 8 , the most significant drop in accuracy happens for budget = 1, which can be justified by the fact that the adversary also has access to the Odds Ratio for each keyword. For budget = 1, the adversary will modify the keyword with the largest Odds Ratio; for budget = 2, the adversary will modify the keywords with the two largest Odds Ratio, and so on. Overall, the targeted changes in the file paths result in small changes in the recall rates. Logistic regression and boosted decision trees have more considerable recall variations than naive Bayes and deep neural networks models. Confidence score variation details are described in table 4. It is easy to observe that the mean change in confidence score is less than 0.1 for all models and all budgets. However, we see examples in our experiments where a single change resulted in a drastic drop in the model confidence. Without access to the model output, we conclude that it is hard to craft an adversarial example close to the original sample, even when an adversary has access to a list of keywords highly correlated with the positive class. A common measure of success for detection systems based on machine learning is the true positive rate, also know as recall. To guarantee a successful deployment the detection system, another metric of extreme importance is the false positive rate. The usual training pipeline evaluates this metric in data distributions that are similar to the training data. However, when possible, an extremely valuable experiment is to collect data from different distributions to understand model performance under different conditions. When a positive detection occurs, detection systems usually triggers an action, which is usually human data review. If the false positive rate is not well understood, and model operation correctly calibrate, it can cause a burden to content moderators, and jeopardize the deployment of the system. The benign samples dataset was constructed using the publicly available common crawl dataset. We collected data from Common Crawl index CC-MAIN-2021-10. The WARC files utilized to construct our dataset are: • Linux file paths: we parsed the first 200 WARCs (00000-00199, inclusive) which resulted in over 73k unique paths. • Windows file paths: we parsed 11821 WARCS (00000-12000, inclusive) which resulted in 32K unique paths. We parsed the raw HTML, treating it as a Latin-encoded string.In each HTML, regular expression functions for identifying windows and linux filepaths are the following: After collecting the dataset using the functions above, we filtered Windows filenames to exclude ":/". In Linux filenames we only keep the filepaths that begin with: /usr/, /home, /etc, /tmp, and /var. The evaluation of model performance in independent datasets is essential to understand model generalization. Most importantly, CSAM file paths account for a small fraction of file paths in sharing platforms. We test our best-performing model against a dataset containing only benign file paths. We measured the false positive rate for the best-performing model, char-CNN, at different confidence thresholds. As we can observe from figure 9, for decision thresholds above 0.8, the false positive rate is low. For example, for Linux file paths, a decision threshold of 0.8 results in an FPR of ≈ 0.03, whereas a decision threshold of 0.95 results in an FPR of ≈ 0.001. For Windows file paths, a threshold of 0.8 prompts an FPR less than 0.01, while a decision threshold of 0.95 leads to an FPR less than 0.001. High decision thresholds are common design choices in detection systems, where only the high confidence samples are flagged and sent for human review. Given the high AUC values for the character-based CNN, measured with Project VIC's dataset, it seems reasonable to believe that we can achieve a good recall for thresholds above 0.8 and thus guaranteeing a low falsepositive rate. In this paper, we proposed a training and evaluation framework to develop deployment-ready machine learning models for CSAM file detection. Our framework includes a model training pipeline that covers several techniques for text representation and machine learning. Our evaluation pipeline covers real-world scenarios that surface when deploying a machine learning model for CSAM detection. All testing scenarios are rigorously defined and easily reproducible. The proposed system for CSAM identification based solely on file paths has the advantage of not working directly with CSA photos or videos. The classifier is a medium agnostic CSAM detector of easy maintenance and reduced legal restrictions for acquiring training data. Our classifier achieves precision and recall rates over 0.90 in out-of-sample hard drives. Our experiments also show that our models generalize well to identifying CSAM content in file storage systems and preserve low FPR in out-of-sample negative samples. Additionally, we present a testing framework to evaluate model robustness to adversarial attacks introduced at test time. The proposed framework is an essential addition to the available tools for CSAM detection. The community can leverage the proposed framework to train and evaluate models for CSAM metadata and short-text classification tasks, such as file path classification and CSAM search terms classification. In combination with PhotoDNA hash, computer vision tools, and other forensics tools, our CSAM file path classifier integrates a global toolset that enables organizations to fight the distribution of CSAM. Online child sexual abuse imagery falls into a category of content that should not be distributed or be present in file storage systems. The distributed nature of the internet makes CSAM detection a complex problem to solve. Automated tools and machine learning-based systems can help technology companies and investigation agencies rapidly identify such content and take the appropriate actions. How much noise is too much: A study in automatic text classification File name classification approach to identify child sexual abuse Synthetic and natural noise both break neural machine translation Exploring high-level features for detecting cyberpedophilia. Computer speech & language Project vic: Helping to identify and rescue children from sexual exploitation Rethinking the detection of child sexual abuse imagery on the internet Internet Child Abuse: Current Research and Policy A decision-theoretic generalization of on-line learning and an application to boosting Pornography and child sexual abuse detection in image and video: A comparative evaluation Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples How robust are character-based word embeddings in tagging and mt against wrod scramlbing or randdm nouse? Textfool: Fool your model with natural adversarial text A statistical interpretation of term specificity and its application in retrieval The internet is overrun with images of child sexual abuse. what went wrong A benchmark methodology for child pornography detection Feature subset selection in text-learning On discriminative vs. denerative classifiers: A comparison of logistic regression and naive bayes Pornographic image detection utilizing deep convolutional neural networks 1738 -protect our children act of Detecting deceptive behaviour in the wild: text mining for online child protection in the presence of noisy and adversarial social media communications icop: Automatically identifying new child abuse media in p2p networks icop: Live forensics to reveal previously unknown criminal media on p2p networks. Digital Investigation Deterrence of Online Child Sexual Abuse and Exploitation Child sexual abuse images and online exploitation surge during pandemic Meet the new anti-grooming tool from microsoft, thorn, and our partners Child safety online: Global challenges and strategies A two-tier image representation approach to detecting child pornography Comparing methods for detecting child exploitation content online Measuring a year of child pornography trafficking by us computers on a peerto-peer network Predicting domain generation algorithms with long short-term memory networks Detecting homoglyph attacks with a siamese neural network On the detection of images containing child-pornographic material Character-level convolutional networks for text classification