key: cord-0617492-nq1i8k5n authors: Khanna, Sameer title: Conical Classification For Computationally Efficient One-Class Topic Determination date: 2021-10-31 journal: nan DOI: nan sha: 8588986699d283043b7f46eec07f67176bff225f doc_id: 617492 cord_uid: nq1i8k5n As the Internet grows in size, so does the amount of text based information that exists. For many application spaces it is paramount to isolate and identify texts that relate to a particular topic. While one-class classification would be ideal for such analysis, there is a relative lack of research regarding efficient approaches with high predictive power. By noting that the range of documents we wish to identify can be represented as positive linear combinations of the Vector Space Model representing our text, we propose Conical classification, an approach that allows us to identify if a document is of a particular topic in a computationally efficient manner. We also propose Normal Exclusion, a modified version of Bi-Normal Separation that makes it more suitable within the one-class classification context. We show in our analysis that our approach not only has higher predictive power on our datasets, but is also faster to compute. In the era of the rapid development of computers and the Internet, information on a wide range of topics is pervasive. The amount of text based data is ever increasing in size, magnitude, and variety. Whether it is for e-commerce (Xiao and Tong, 2021) , clinical diagnosis determination (Le et al., 2021) , or fake news detection (Ahmed et al., 2018) it is vital to have efficient mechanisms for topic classification in order to effectively parse and process text based media. Most of the research on topic classification uses these implementations within a binary classification or multi-class classification context (Trstenjak et al., 2014; Zhang et al., 2011; Kim and Gil, 2019; Kim et al., 2019a; Liu et al., 2018) . Comparatively, there is a relative dearth of content variety discussing and proposing different algorithms that can identify text on a particular subject from a variety of subjects in a One-Vs-All configuration, espe-cially regarding how to use vector representations of documents with low computational costs. This is unfortunate, as one class classification of text enables us to identify text of a particular form from a potentially non-exhaustible set of potential topics. In such a setting, it would be arduous to identify all potential topics we may come across and extremely time-consuming to label enough data to train a model for multi-class classification. In practice, the lack of research into one class topic determination has lead to subpar implementations for the sake of speed. One of the best examples of the ramifications of this lack of research focus is insider threat detection systems. Despite insider threat detection primarily working with log and textual information, the vast majority of published work on the subject do not utilize Natural Language Processing in their implementations (Wei et al., 2021; Tuor et al., 2017; Meng et al., 2018; Le et al., 2018; Le and Zincir-Heywood, 2019) . Many that do simply sum over TF-IDF vectors before feeding the result as a feature into detection models (Chattopadhyay et al., 2018; Sajjanhar et al., 2019) . We aim to tackle these issues head on. Our contributions are as follows: • We propose Normal Exclusion, a re-framing of Bi-Normal Separation enabling usage for one-class classification. • We show that our approach, Conical Classification (CC), achieves optimal performance when compared to alternative one-class topic determination strategies. With the intention of assessing the predictive power of one-class based text classification methods, Joffe et al. has compared one-class support vector machines (OCSVM) to binary support vector machines (SVM) to identify specific phenotypes in breast cancer. They found that OCSVM performed comparably to SVM in balanced dataset problem spaces and outperformed SVM in highly imbalanced datasets (Joffe et al., 2015) . Zhuang et al. concurs , citing the improved performance of switching from a SVM to OCSVM approach for minority class classification. They use a general framework which first uses the minority class for training in the one-class classification stage, then incorporate data from the majority class to improve the generalization performance of the constructed classifier (Zhuang and Dai, 2006 (Manevitz and Yousef, 2001) , and Seo utilizing a OCSVM to help classify images in a database using color and text content for content-based image retrieval (Seo, 2007) . Ensemble based methodologies have been used in practice as well. Hempstalk et al. has utilized an ensemble-based approach using C4.5 decision trees with Laplace smoothing to isolate real target values from those of an artificial class (Hempstalk et al., 2008) , validating performance on various UCI datasets as well as a custom typist dataset. Anderka et al. utilized a similar approach to detect text quality flaws, using a Random Forest as the base classifier instead (Anderka et al., 2011) . Unfortunately, despite their higher memory and computation requirements, such approaches have little performance benefits compared to the OCSVM; Hempstalk et al.'s results indicated that their ensemble approach was not demonstratively superior to the OCSVM approach. While not traditional one-class classification algorithms, there are a set of classifiers that co-train using a set of positive labeled data as well as a set of unlabeled data for evaluation. Denis et al. has developed the Positive Naive Bayes (PNB) classifier that works under this setting, using it successfully to classify documents in the 20-Newsgroup dataset (Denis et al., 2003) . One-class topic determination is a problem space where it is paramount to be computationally fast with low resources in order to process large numbers of documents in a short amount of time. This has traditionally excluded recent advancements in Natural Language Processing such as embeddings from the discussion, as these take significant amounts of computation time on the modest hardware such application spaces necessitate. This has resulted in very few publications dedicated to assessing their application to the space. Ruff et al. propose Context Vector Data Description (CVDD) (Ruff et al., 2019) , a textual anomaly detection algorithm that builds upon word embedding models to learn multiple sentence representations that capture multiple semantic contexts via the self-attention mechanism. Hu et al. extended uni-modal Support Vector Data Description (SVDD) to a multiple modal one, building Multi-modal Deep Support Vector Data Description (mSVDD) with multiple hyperspheres, enabling them to build better descriptions for target one-class data (Hu et al., 2021) . The methodology used to create the vector representations of documents can be just as important as the detection algorithm used. One main approach that has come about as a result is term frequency (TF) -inverse document frequency (IDF). TF-IDF is the product of two statistics: TF and IDF. TF, as its name suggests, refers to the normalized frequency f of a word w j that appears in the given document D. Originally coined as term specificity by Jones (Jones, 1972) , IDF provides a measure of how much information a word provides depending on how common the word is in a given corpus. TF-IDF has been successfully used for topic classification in a variety of scenarios, ranging from social media (Lee et al., 2011) , research analysis (Kim and Gil, 2019) , and news discovery (Hakim et al., 2014) . As a result, much research has been done on modifications to improve performance. Martineau et. al. has proposed Delta TF-IDF which scales weights using word scores before classification and boasts a higher accuracy than standard TF-IDF (Martineau and Finin, 2009 ). Forman studies replacing TF-IDF with Bi-Normal Separation (BNS), eliminating the need for fine-tuned feature selection and performs exceptionally well on short length documents (Forman et al., 2003) . Domeniconi et. al. used a supervised variant to prevent the IDF term from affecting documents within the category under analysis, so that terms frequently appearing in said category are not penalized (Domeniconi et al., 2015) . More recently, vector representations have been developed that use embeddings, such as BERT (Devlin et al., 2018) and GloVe (Pennington et al., 2014) . Such embeddings allow for words with similar meanings to have a similar representation which has allowed for the impressive performance of deep learning methods on complex and intricate natural language processing problem spaces. BNS, which is the measure of how much the probability of occurrence of a given word in the positive class differs from the probability of occurrence of a given word in the negative class, has a couple of key benefits as a VSM metric: it is excellent at ranking words for automated feature selection filtering, it has the best performance in single metric VSM analyses, and is consistently a member of the optimal pairs of VSM metrics Forman et al. evaluated (Forman, 2008) . Thus, being able to utilize BNS within a one-class context would be ideal. The formula used to calculate BNS is given in Equation 1. Here, tpr is the true positive rate P (word|positiveclass) as determined via tp pos , where tp is the number of positive training cases containing the word and pos is the number of positive training cases. Likewise, f pr refers to the false positive rate P (word|negativeclass) as determined via f p neg , where f p is the number of negative training cases containing word and neg is the number of negative training cases. F −1 is the inverse Normal cumulative distribution function. is a number with small magnitude added to avoid the undefined scenario of F −1 (0); for the purposes of our analysis, we set to 0.0005, or half a count out of 1000. A naive translation to a one-class regimen would be to merely remove BNS's dependence on the f pr term. Thus, each word would be scaled in relation to its frequency of occurrence within our positive training set. This leads to issues, as words with a naturally high occurrence in language such as the, be, to, of, a, etc. will have predominantly high scaled values. One may try to work around these effects by removing stopwords and unrelated words from our corpus, but this can require significant hand-tuning by an expert in the field while increasing overhead computation costs. We propose an alternative solution that takes advantage of the nature of one-class classification, recalling that we wish to be able to identify text of a particular topic from any assortment of topics possible from the language. We simply need to estimate the f pr of the word with the frequency of the word in our given language. For English, there are large corpuses from which we can extract this information, for example the Oxford English Corpus (OEC) is a dataset that presents all types of English, from blogs to newspaper articles to literary novels and even social media, sourcing from Englishes from the United Kingdom, the United States, Ireland, Australia, New Zealand, the Caribbean, Canada, India, Singapore, and South Africa. For our purposes, we compiled the frequencies of the top 1 3 million words in the human language using Tatman's English word count dataset (Tatman, 2017; Brants and Franz, 2006) and stored them within a dictionary for rapid lookup. We can safely set the frequencies of words that do not appear in our dictionary to 0, as these include words that rarely appear in standard language; such words include abaptiston, abaxile, grithbreach, gurhofite, zarnich, and zeagonite. Indeed, according to Oxford's compiled statistics, the combined frequency of occurence for all such words is approximately a percent of the entire lexicon of the English language, easily within the margin of error for our analysis (Oxford, 2011) . We coin our tweaked formula Normal Exclusion (NE), as it excludes, or reduces, the weightage of words that are inconsequential to determining the topic of text without requiring a negative corpus to be present. The formula for NE is shown in Equation 2. Here, Dict[word] represents the frequency value for the given word as found within our dictionary. We will scale NE by TF for our model developing the NE-TF VSM. Our representation of a word in a model will thus be determined by how frequently a word occurs in our corpus, scaled by the statistical significance of the word within the evaluated text. Higher magnitude values give a strong indication that the vector is about our target topic, while lower values would lead to a lower confidence that such a conclusion is correct. VSM is based on the notion of vector similarity; the model assumes that the relevance of a document to another document is roughly equal to the document-query similarity. Under this model, the documents are represented using the bag-of-words approach. This means that documents are translated to n-dimensional vectors, where each dimension corresponds to a word based on a compiled set of terms known as a vocabulary. Under such models, we map a given topic to a certain subset of the compiled vocabulary. It is not enough however for a document to have a high frequency of words included within the subset to be classified as a given topic. Combinations of words are vital to the classification process. For a timely example, a news article regarding COVID-19 and an administration protocol manual on COVID-19 vaccines will both strongly correlate to words such as vaccines, dosages, Pfizer, Moderna, among others. To distinguish between these two topics, we would need contextual words such as policy, mandate, and president to identify a news article, and words like intramuscular, angle, deltoid, and subcutaneous would likely exist within an administration protocol manual. While these contextual words will have a lower correlation to a given topic, they are nonetheless paramount for an effective classification model. This leads to a high significance of vector orientation within a VSM as it is crucial to keep track of how a word represented by a certain dimension relates to words represented by different dimensions. The high interdependence between VSMs and orientation allows one to assess document similarity solely from the context of vector angles. For example, to rank similarity within a category, a simple and popular mechanism is to calculate the Relevance Status Value which computes the cosine of the angle between the query and each document in the collection (Rao and Gudivada, 2018) . The larger the cosine value, the smaller the angle, and the more similar the documents being compared are. It is important to note at this point that while vector magnitude would typically be a crucial metric to consider as well, Rao et al. furthers, stating VSM vectors are typically normalized before further computation and analysis is done. This means that documents of the same topic will have smaller angles between each other than those comprised of different topics altogether. Extrapolating from this observation to the comparison of a document to an entire corpus, we expect for vectors corresponding to the same topic to be close to the center of the distribution of corpus vectors in order to have a low angle to all vectors in the corpus. Similarly, we expect vectors from a differ- Vector With Different Topic Figure 1 : New document vector of the same topic versus new document vector of a different topic. Green refers to a document that will be classified as having the same topic, red will be classified as not having the same topic. ent topic to have a high angle from the vectors in the corpus. Figure 1 provides an illustration of the expected phenomenon. Classified as Same Classified as Different Note we do not yet consider documents that are edge case scenarios. To simplify nomenclature for further discussion, we refer to vectors within our corpus that are most dissimilar to the other vectors in the corpus our fringe vectors. We consider fringe vectors to be as distant from the corpus as possible while still being considered as having the same topic. Thus, as shown in Figure 2 , the similarity with respect to a fringe vector is not sufficient to be classified as having the same topic as the given corpus; if a vector is similar to a fringe vector, but less similar to rest of the corpus than the fringe vector, we will consider the vector being evaluated to be of a different topic. In other words, a vector must be in-between our fringe vectors across all dimensions to be considered as having the same topic as our corpus. From here, we can translate the classification problem into a linear combination problem. As shown in Figure 3 for the two dimensional case, any vector found in between two vectors can be represented by their linear combination. We define a vector as being in-between two vectors if the sum of its angles to each vector is equal to the angle between the two vectors themselves and it lies on the plane defined by the two vectors. Note this vector can always be calculated as a linear combination of its surrounding vectors; Algorithm 1 shows an approach based on binary search that allows one to identify the scalar combinations needed to recreate the target vector. Here, cos sim refers to cosine similarity (Sitikhu et al., 2019) , target is the vector we are trying to recreate, x and y are the vectors target is in-between while λ x and λ y are the scalar values such that xλ x + yλ y = target. Result: λx, λy vectorone = x; vectortwo = y; mid = vectorone+vectorone 2 ; λx = 1 2 ; λy = 1 2 ; level = 1; while mid = target do level = level + 1; simone = cossim(vectorone, target); simtwo = cossim(vectortwo, target); if simone ≥ simtwo then mid = vectortwo; λx = λx + 2 −level ; λy = λy − 2 −level ; else mid = vectorone; λx = λx − 2 −level ; λy = λy + 2 −level ; end end This conclusion also makes intuitive sense. As discussed earlier, we can identify a document as being from a particular topic if it has word combinations that indicate as such. A vector that is a linear combination of those within the corpus must have one or more such identifying word combinations as a result. It is important to note that by linear combinations, we specifically refer to the set of positive linear combinations. As mentioned earlier, orientation of vectors is crucial in regards to which documents and word combinations they represent. A negatively scaled vector represents the complete opposite document than a positively scaled counterpart and thus should not be used for topic classification. We have shown it is enough to compose a vector as a positive linear combination of the vectors in a corpus to confirm that it is regarding a similar topic. In other words, a document has the same topic as a corpus if its vector representation is within the positive span of the corpus. The positive span of vectors v 1 through v k ∈ R n is the linear combination k i λ i v i where λ i ≥ 0 for all i = 1, ..., k (Davis, 1954) . Note that the original definition made by Davis allows for the zero vector to be included within the positive span. However, the zero vector within the VSM context represents a vector with none of the terms corresponding to the corpus topic; we thus wish to exclude the zero vector from our span in order to properly classify documents based on their topic. Our new span, coined the conical span, of vectors v 1 through v k ∈ R n is the linear combination k i λ i v i where k i λ i > 0 for all i = 1, ..., k. We define the conical set that can be defined via the conical span of a finite number of vectors in Equation 3. Conical span enables a large range of possibilities from the positive span of vectors; Figure 4 showcases the vast representational power in three dimensional space, where the addition of extra vectors dramatically increases the variety of subspace shapes that can be created (Stappen, 2020) . At this point, we have shown that it is sufficient for topic classification that a vector is within the conical span, and we have displayed the expressive power of the conical span. We will now go over an efficient mechanism to determine if a vector is within the conical span. As Rao in our corpus as well as our evaluation vectors are unit norm in length (Rao and Gudivada, 2018) . In order to train our CC system, we simply find the largest value and the smallest value for every dimension in our corpus vectors, and store them within two vectors for analysis later on. Then when it comes to predicting with CC, we simply need to compare our evaluation vector with both vectors in order to determine if the vector belongs to our corpus. We prove this claim via the following lemmas and theorems. Lemma 4.1. There is no unit vector within the conical span that is larger in one or more dimensions than the max vector or smaller in one or more dimensions than the min vector. Proof. We prove by contradiction. Assume there is a vector in the conical span whose value in one or more dimensions is larger than than the max vector or smaller than the min vector. The values in the max and min vectors are set by the fringe vectors for the given dimensions, due to these vectors having the largest deviation from the corpus acceptable. For a vector to have a value outside of this range, the given vector must deviate further from the rest of the corpus than our fringe vectors. This leads to a contradiction; by definition, any vector less similar to our corpus than the fringe vectors must not be classified as being of the same topic and thus within the conical span. Lemma 4.2. There is no unit vector outside the conical span that is smaller in a given dimension than the max vector and larger in a given dimension than than the min vector. Proof. We prove by contradiction. Assume that a unit length vector outside the conical span exists such that its values are in between the min and max vectors. As mentioned in Lemma 4.1, the max and min vectors are defined by the value of our fringe vectors for each dimension of our VSM. A value closer in similarity to the rest of our corpus is by definition within our conical span. This leads to a contradiction: a vector cannot both be more similar to main vectors within our corpus than our fringe vectors and be classified as a different topic. Theorem 4.3. All possible unit vectors in the conical span can be represented by a max and min vector. Proof. By combining Lemmas 4.1 and 4.2, we arrive at the conclusion that Theorem 4.3 is indeed correct. This result enables us to rapidly train and determine if a given vector is of a certain topic or not. If a vector is not the zero vector, is less than the max vector across all dimensions, and is greater than the min vector across all dimensions, then we classify the vector as being of the same topic as it is within the conical span of the topic training corpus. As detailed in the Related Works section, OCSVMs have an extremely high adoption rate within the space, thus for our analysis, we evaluate the performance of OCSVMs on the following kernel functions: linear, sigmoid, radial basis function (RBF), and polynomial (Poly). To represent our set of ensembles, we will train a One Class Random Forest classifier (OCRF) using Goix et al.'s splitting method (Goix et al., 2017) as well as an Isolation Forest classifier (IsolFor) (Liu et al., 2008) . For both methods, we will use 1000 estimators. We also utilize PNB as a baseline measure. Since we wish to evaluate its performance in the one-class classification regime, we will use the evaluation data itself as the unlabeled set of data for training the algorithm; this allows us to only pass in the positive set of data points during training as is the case for traditional one-class classification algorithms. Finally, to represent embedding based models, we use CVDD as our representation for context preserving embedding based approaches as well as for neural NLP, taking advantage of the official implementation known as CVDD-PyTorch (Ruff et al., 2019) . Both GloVe and BERT models are assessed for evaluation purposes, with embedding size, attention size, and number of attention heads set to be the best performing configuration . Except for our CVDD baselines, all of our baseline models will use TF-IDF as the VSM of choice. Our intent is to evaluate our baselines as well as CC in scenarios that can require high performance. As mentioned in the Introduction, one main place where this can occur is in insider threat detection. The golden standard dataset for insider threats is the CERT Insider Threat dataset, the largest public repository of insider threat scenarios compiled after analyzing 1,154 actual insider incidents (Glasser and Lindauer, 2013) . Within this dataset, there are three key website topics that are crucial to detect: Keylogger websites, Wikileaks-like websites, and job posting sites. We extract the text related to both Keylogging and Wikileaks by hand-labeling the text content within version 4.2 in order to use them both for evaluation purposes. For the purpose of evalauting the latter of the three, we extract text related information from Vidros et al.'s Fake JobPosting Prediction dataset (Vidros et al., 2017) , and from PromptCloud's job dataset (PromptCloud, 2017) . Both are high quality datasets listing full descriptions of jobs with large varieties, and versions of both datasets have been used by a plethora of publications (Balachander and Moh, 2018; Kim et al., 2019b; Alghamdi et al., 2019; Mahbub and Pardede, 2018; Reddy et al., 2018) . For our purposes, we extract text data from the real job postings in Vidros et al.'s dataset. We also desire our evaluation set to have exposure to e-commerce applications, medical record information, and fake news articles. For our ECommerce dataset, we utilize the Women's Clothing E-Commerce dataset (Agarap and Grafilon, 2018), which has seen popularity for sentiment analysis and text classification tasks (Sun et al., 2019; Lin, 2020; Kousta and Bellet; Cascaro et al., 2019) . Our MedicalTranscription dataset consists of text extracted from the Collection of Transcribed Medical Transcription Sample Reports and Examples (MTSamples), a dataset of interest in academia both from a natural language processing perspective as well as from a medical assessment point of view (Beattie and Richards, 1994; Moramarco et al., 2021; Zuccon et al., 2014) . Finally, for Fak-eNews, we utilize the Fake and real news dataset (Ahmed et al., 2018 (Ahmed et al., , 2017 ; this dataset is especially relevant due to recent increases in the proliferation and rapid diffusion of fake news on the Internet. We chose this set of classification markers not only due to its representation of some of the fields we expect CC to be applicable, but also due to the high variability in text length and composition; our Wikileaks and Keylogger datasets are small length texts composed primarily of keywords, whereas the MoviePlot and MedicalTranscription datasets have relatively verbose text covering complex and protean topic ranges. This large variety is crucial as research has shown that text length and topic variations have a dramatic affect on text-based classification performance (Wang and Manning, 2012). When a given dataset is being evaluated as the positive class, the rest of the datasets are combined and treated as the negative class. Since our training set does not require any data from the negative class, we split each class via a 50%-50% split between our validation and test sets. Our positive class is split using a 70%-15%-15% split between our training set, our validation set, and our test set. Resplitting our train and test sets each run, we compile evaluation metrics accuracy, balanced In order to be able to compare compute times, all models will be run on the same free instance of Google Colaboratory (Google, 2019) . Our evaluation instance had a single core running at 2.00GHz, and had access to 13 Gb of RAM. Finally, we discuss the various VSM models used, comparing the baseline VSMs with NE-TF. Performance metrics can be found in Table 1 . CC outperforms baseline models in most scenarios, being the only model with mean accuracies consistently above 95%, balanced accuracies above 94%, and precision, recall, and F1 scores above 93%. PNB had the largest variability out of all algorithms both on a dataset level as well as on a per run level, showcasing how dependent it is on the exact distribution of words that exist within the unlabeled set. OCRF is one of the best performing baseline models, while IsolFor performed the worst, clearly showing that the splitting algorithm used to determine tree structure is crucial for topic determination with ensemble models. The Linear OCSVM outperformed OCSVM alternatives. The performance delta between BERT and GloVe does not justify the additional computation costs involved with using a BERT encoding for our problem space. Both neural NLP models are consistently outperformed by CC across datasets for one class topic determination. While CVDD and other neural NLP algorithms that use embeddings have use cases in one-class topic determination where they work well, they perform worse when the positive class is highly manifold in nature as is the case for the Jobs and MoviePlots datasets. Where CC truly shines is in computational efficiency, showcased in the scenarios with high text complexity. Since we compare each evaluation vector to the max and min vectors, CC has a worse case runtime efficiency of O(dn), where d is the vector dimension number and n is the number of vectors to be evaluated. In practice however, the efficiency is much greater, as we can short-circuit computation as soon as we find a discrepancy; this is a benefit that none of the baselines have. When we compare this to ensembles with a runtime of O(d*nlog(n)), kernel OCSVMs with a runtime of O(n support *dn) where n support is the number of support vectors, PNB which has a runtime of O(dn + 4d) due to performing training and evaluation at the same time, and neural NLP solutions having a forward pass complexity of O(n 4 ) (Fredenslund, 2018) , the efficiency of CC is clear. Linear OCSVM has the highest computation efficiency out of the baselines, with a similar worse case runtime efficiency as CC of O(dn). However for each vector at each dimension, Linear OCSVM performs two operations compared to only one, a multiplication as well as an addition. Additionally, Linear OCSVM has no short-circuit capability, so it will always take the maximal amount of time to compute. This can be seen in our results, where CC outperforms Linear OCSVM in computation time especially on the more complex datasets like MedicalTranscriptions and MoviePlots where the time differences are stark. We identified that the encoding and embedding process is the foremost reason behind the long computation times both versions of CVDD has. This is the reason behind the development of NE-TF; being a bag-of-words VSM it boasts great speed in creating its vector representations. Additionally, bag-of-words VSM models like NE-TF also provides benefits in terms of memory footprint; for our datasets, SpacyEncoding requires 154.7MB, BertTokenizer requires 157.1MB, while NE-TF requires only 18.3MB leading to a roughly 9 times smaller footprint. When we compare to alternative bag-of-words VSM models NE-TF has a comparable memory footprint but is faster to compute; the statistical significance weighting mitigates the need for stop word pruning, further improving performance. We show that Conical Classification is a computationally efficient method of one-class topic classification that aims to identify whether a vector is within the conical span of the training corpus for a given topic. When combined with Normal Exclusion, Conical Classification showcases the optimal combination of predictive power, consistently great results, and fast computation times. Statistical analysis on e-commerce reviews, with sentiment classification using bidirectional recurrent neural network (rnn) Detection of online fake news using n-gram analysis and machine learning techniques Detecting opinion spams and fake news using text classification An intelligent model for online recruitment fraud detection Detection of text quality flaws as a one-class classification problem Ontology based similarity for information technology skills Separation of metallothionein isoforms by micellar electrokinetic capillary chromatography Aggregating filter feature selection methods to enhance multiclass text classification Scenario-based insider threat detection from cyber activities Theory of positive linear dependence Text classification and cotraining from positive and unlabeled examples Bert: Pre-training of deep bidirectional transformers for language understanding A comparison of term weighting schemes for text classification and sentiment analysis with a supervised variant of tf. idf Bns feature scaling: an improved representation over tf-idf for svm text classification An extensive empirical study of feature selection metrics for text classification Computational complexity of neural networks Bridging the gap: A pragmatic approach to generating insider threat data One class splitting criteria for random forests Google colaboratory Automated document classification for news article in bahasa indonesia based on term frequency inverse document frequency (tf-idf) approach One-class classification by combining density and class probability estimation Oneclass text classification with multi-modal deep support vector data description Expert guided natural language processing using one-class classification A statistical interpretation of term specificity and its application in retrieval Multi-co-training for document classification using various document representations: Tf-idf, lda, and doc2vec Fraud detection for job placement using hierarchical clusters-based deep neural networks Research paper classification systems based on tf-idf and lda schemes. Human-centric Computing and Information Local interpretable model-agnostic explanations for long shortterm memory network used for classification of amazon customer reviews Benchmarking evolutionary computation approaches to insider threat detection Machine learning based insider threat modelling and detection 2021. Machine learning based on natural language processing to detect cardiac failure in clinical narratives Md Mostofa Ali Patwary, Ankit Agrawal, and Alok Choudhary Sentiment analysis of e-commerce customer reviews based on natural language processing Research of text classification based on improved tf-idf algorithm Isolation forest Using contextual features for online recruitment fraud detection One-class svms for document classification Delta tfidf: An improved feature space for sentiment analysis Deep learning based attribute classification insider threat detection for data security Towards objectively evaluating the quality of generated medical summaries Collection of transcribed medical transcription sample reports and examples The oec: Facts about the language. Oxford English Dictionary Scikit-learn: Machine learning in Python Glove: Global vectors for word representation Us jobs kaggle dataset Computational analysis and understanding of natural languages: Principles, methods and applications Analysis of e-recruitment systems and detecting erecruitment fraud Selfattentive, multi-context one-class classification for unsupervised anomaly detection on text Image-based feature representation for insider threat classification An application of oneclass support vector machines in content-based image retrieval A comparison of semantic similarity methods for maximum human interpretability Motion and manipulation lecture series Sentiment analysis of commodity reviews based on multilayer lstm network Knn with tf-idf based framework for text categorization Deep learning for unsupervised insider threat detection in structured cybersecurity data streams Automatic detection of online recruitment frauds: Characteristics, methods, and a public dataset Baselines and bigrams: Simple, good sentiment and topic classification Insider threat prediction based on unsupervised anomaly detection scheme for proactive forensic investigation Prediction of user consumption behavior data based on the combined model of TF-IDF and logistic regression A comparative study of tf* idf, lsi and multi-words for text classification Parameter estimation of one-class svm on imbalance text classification De-identification of health records using anonym: Effectiveness and robustness across datasets 1.000 ± 0.000 1.000 ± 0.000 1.000 ± 0.000 1.000 ± 0.000 1.000 ± 0.000 253.909 ± 28.636 BERT CVDD1.000 ± 0.000 1.000 ± 0.000 1.000 ± 0.000 1.000 ± 0.000 1.000 ± 0.000 298.142 ± 38.494 CC 1.000 ± 0.000 1.000 ± 0.000 1.000 ± 0.000 1.000 ± 0.000 1.000 ± 0.000 11.163 ± 0.014