key: cord-0484038-wig5xpy2 authors: Divakaran, Dinil Mon; Oest, Adam title: Phishing Detection Leveraging Machine Learning and Deep Learning: A Review date: 2022-05-16 journal: nan DOI: nan sha: bee8c30965dd7b6d176f23155747731a081731d6 doc_id: 484038 cord_uid: wig5xpy2 Phishing attacks trick victims into disclosing sensitive information. To counter rapidly evolving attacks, we must explore machine learning and deep learning models leveraging large-scale data. We discuss models built on different kinds of data, along with their advantages and disadvantages, and present multiple deployment options to detect phishing attacks. Phishing attacks attempt to deceive users with the goal of stealing sensitive information such as user credentials (username and password), bank account details, and even entire identities. To achieve this, an attacker deploys a malicious website that resembles the legitimate website of a well-known target brand, e.g., Figure 1 shows a real phishing page targeting Facebook. Subsequently, the attacker sends a link to the phishing website to potential victims and leverages social engineering techniques to lure those victims into disclosing confidential information. Figure 2 illustrates the general process, which can be quite complicated in practice. A recent phishing campaign used a compromised email account to send victim users a malicious attachment containing obfuscated code. This code, in turn, retrieved dynamic scripts from a hosting server that ultimately generated HTML to render a phishing page targeting Microsoft Office 365 users (for a detailed description, see [1] ). Indeed, phishing attacks have evolved to increasingly use shortened URLs, redirection links, the HTTPS protocol, server-and client-side cloaking techniques (displaying benign pages to anti-phishing crawlers), etc., to evade detection [2] , [3] . To add to these, today there exist software tool kits, or phishing 2) The attacker sends the corresponding link to numerous potential victims via emails, social networks, etc. 3) A user who gets deceived accesses the phishing webpage and provides sensitive information. 4) The user credentials are retrieved by the attacker, 5) to access the targeted website to commit fraud. kits, in dark markets, that automate the majority of the aforementioned process. Thus, it is no surprise that phishing continues to be one of the top cyber attacks today. Furthermore, the COVID-19 pandemic and the prevalence of remote work have given rise to new social engineering and victimization opportunities for attackers [4] . A well-known technique for protecting users from accessing phishing webpages is to use an anti-phishing blacklist. Webpage URLs accessed from a modern web browser (e.g., Google Chrome, Mozilla Firefox, or Microsoft Edge) are automatically checked against a list of known phishing URLs; and access to a matched URL is warned/blocked. The URLs that ultimately end up on anti-phishing blacklists are collected, crawled, and analyzed by anti-phishing entities (e.g., Google Safe Browsing, the Anti-Phishing Working Group, or PhishTank), each of which leverages a variety of threat intelligence sources and cross-organizational partnerships. Blacklists are efficient and scale well with the number of phishing webpages in the wild. A URL lookup can easily be done in real-time, using the user's computer resources, and such lookups also preserve user privacy. However, anti-phishing blacklists are not themselves capable of detecting new and unseen phishing pages: they rely on crawlers that must analyze the webpage content beforehand. Such crawlers are, in turn, susceptible to evasion attacks [5] that make phishing web-pages appear benign whenever they are accessed by a crawler. Blacklist-based approaches also inherently struggle when attackers use redirection links as lures, as these benign-looking links differ from the URL of the corresponding phishing webpage. These limitations of large-scale antiphishing systems have motivated researchers to explore learning algorithms to develop phishing detection solutions capable of running in multiple different contexts, that can keep up with the evasion efforts of attackers. A supervised machine learning (ML) algorithm takes a large labeled dataset as input to train a classification model that subsequently classifies an input data point into a given number of classes. Figure 3 presents an ML pipeline for developing supervised models that detect phishing attacks. In a phishing webpage detection problem, there are only two classes-benign and phishing; and hence, the trained models are binary classifiers. Each data point (e.g., a URL) in the input dataset is accompanied by a ground-truth label of benign or phishing, to help the model learn the discriminative characteristics of the two classes. Let M denote such a trained model. In operation (also called the inference phase), this trained model M is given a webpage URL, and it computes the probability p that the webpage is a phishing webpage. Given a configuration threshold θ, the input (URL) is classified as phishing if the pre- diction probability is greater than or equal to this threshold, i.e., if p ≥ θ; and benign otherwise. The accuracy of a model is dependent on a few factors, some of the important ones being, the model algorithm, the quality and quantity of data used for training, and the value of classification threshold used in operation. A phishing webpage correctly classified is considered a True Positive (TP) and a benign webpage correctly classified is a True Negative (TN). Conversely, a phishing webpage wrongly classified (as a benign page) is a False Negative (FN) and a benign webpage misclassified as phishing is a False Positive (FP). The most important metrics for evaluating phishing detection models are i) the True-Positive Rate (TPR), also called Recall, and ii) the False-Positive Rate (FPR). The Recall of a model is defined as TP TP+FN ; and the FPR is defined as FP FP+TN . The performance of a classification model is a trade-off between Recall and FPR; e.g., a 100% Recall can be achieved by classifying every webpage as phishing (i.e., when θ = 0), but that would also result in the highest FPR for that model. Therefore, a good evaluation strategy is to measure a model's Recall at varying values of FPR. An FPR of 10 −1 means, on average 1 in 10 benign webpages visited by a user are misclassified as phishing. If the users are accordingly warned or blocked, then an FPR of 10 −1 is clearly unacceptable due to the high number of disruptions caused to legitimate browsing. Therefore, in practical deployments, phishing detection solutions need to evaluate Recall at FPR values of 10 −3 and even lower [6] , [7] . Below we discuss different machine learning techniques for phishing detection. We distinguish the solutions based on the input they process (for training and prediction)-URLs, HTML contents, and webpage screenshots-as this leads to different deployment use cases. Table 1 provides a taxonomy of such detection techniques. A phishing URL classifier trains a classification model on a dataset of benign and phishing URL strings along with the ground-truth labels (i.e., in a supervised way). Researchers often use top-ranking websites from Alexa (https://www.alexa.com/topsites) or Tranco (https://tranco-list.eu) to build a dataset of benign URLs. Whereas, the set of phishing URLs is usually obtained from PhishTank (https://phishtank.org/) and OpenPhish (https://openphish.com/). Once a model is trained, in operation, it is given a URL obtained from a user (e.g., from a user's browser or email, or from a middlebox such as a web-proxy) to make a prediction of whether it links to a phishing page or not. Note that, during both the training and the prediction phases, these models process only the URLs to extract useful information called features; they do not visit the given URLs. Thus, there is no additional cost of webpage visit, network latency, or page-load time for a URL-based ML solution. A URL can be readily processed, and a prediction can be made before the corresponding webpage is visited. Cost of network latency to fully load a webpage and take a screenshot. Prone to adversarial ML attacks. Early classification solutions engineered features from URLs to train machine learning models (Table 1 , row 1). The features try to capture the lexical properties of the URLs; they include, the length of the URL, the length of the domain part of the URL, the length of the path in the URL, the number of permitted non-alphanumeric characters (such as dot and hyphen) encoded in a URL, the number of vowels and consonants, etc. Besides, more complicated features are engineered by learning a vocabulary of words (or tokens), and representing the presence of each word in the URL (separately for different components of the URL) as a long binary vector [8] . This results in long feature vectors, and affects the performance of some of the conventional ML models, such as SVM (support vector machine). There are also external features such as IP and domain reputation, domain registration information from WHOIS database, etc., that are useful for modeling. However, extraction of these external features requires additional network latency or maintenance of up-to-date large databases. A drawback here is that, one has to manually define and engineer discriminative features; and the large body of research works in this direction is evidence that defining all relevant highquality features exhaustively is a challenging task. Furthermore, since the phishing attack ecosystem keeps evolving, the list of features needs to be continuously updated. It is likely that, at any point in time, the list of features used in a model is incomplete. Thus, more recent research works apply deep learning (DL) models to detect phishing URLs. Convolutional Neural Network (CNN) is a state-ofthe-art deep learning model that has been often used for visual analysis, while also being applied for text classification. For building a DL-based phishing URL classifier (Table 1 , row 2), we first transform the input to a vector of numbers, referred to as embedding. Each character in a URL is represented by a vector of fixed dimension, and thus each URL is represented as a matrix, with each row representing a character. Since the matrix dimension is also fixed, the number of characters processed in a URL is pre-determined; i.e., the length of URLs processed is limited. A CNN model essentially performs convolution operations on the input at its different nodes, to learn the differentiating patterns between phishing and benign URLs in the labeled dataset; this model has been shown to be effective in detecting phishing URLs [9] . However, character-level models might not be able to capture relationship between characters, and more importantly words, that are far apart in a URL. This is where a sequence model, often used in language modeling, is useful. The LSTM (long short-term memory) architecture, a widely used sequence model, learns dependencies between characters in a long sequence (or string, as in a URL). The basic unit of an LSTM has multiple gates that allow it to learn (and forget) information from arbitrary points in the sequence. Lee et al. [7, Section 6.1] evaluated the performance of CNN, LSTM, as well as a combined architecture of CNN and LSTM on a large dataset of URLs obtained from enterprise customers. At a low FPR of 10 −3 , the combined CNN-LSTM model performed significantly better (achieving a Recall of ≈ 76%) than the independent models (that achieved ≈ 58% Recall) in detecting phishing URLs. In this combined model, the output of CNN is fed to LSTM, and that of LSTM is used for prediction. The architecture consists of an embedding layer, followed by 256 Conv1D filters, a pooling layer (using MaxPooling), 512 LSTM units, followed by a Dropout layer, a Dense layer, and finally a single unit for classification (using sigmoid activation function). The information available to train a classification model is limited to what is available in a URL. An attacker today has many options to decide on a domain name, the prominent part of a URL; and the rest of the URL (following the domain name), i.e., the path of a file under the website, is completely under the attacker's control. This makes it easy for an attacker to evade a URL-based detection model. Besides, shortened URLs present much less information for a model to make a good prediction. It is therefore not surprising that URLNet [9] , a URLbased deep learning model, incurs large number of false positives in a recent phishing discovery study conducted in the wild (Internet) [6, Section 6]. One way to overcome this limitation is to develop models based on webpage contents and screenshots. We discuss them below. The content of webpages offers a wealth of information that can be exploited to detect phishing attacks (Table 1, row 3) . Since the goal of an attacker is to deceive users to provide their sensitive information, phishing webpages often have HTML forms and other discriminating characteristics. Therefore, the features that capture the presence of HTML forms, tags, and sensitive keywords that prompt users to input username, password, credit card numbers, etc., are useful in building an HTML-based phishing classifier [10] . There are also a number of other features useful for modeling; they include the length of the HTML
,