key: cord-0946570-opph9x5p
authors: Mughaid, Ala; AlZu’bi, Shadi; Hnaif, Adnan; Taamneh, Salah; Alnajjar, Asma; Elsoud, Esraa Abu
title: An intelligent cyber security phishing detection system using deep learning techniques
date: 2022-05-14
journal: Cluster Comput
DOI: 10.1007/s10586-022-03604-4
sha: 251258f4c2fd6c0d42419593ad622d3358b0319d
doc_id: 946570
cord_uid: opph9x5p

Recently, phishing attacks have become one of the most prominent social engineering attacks faced by public internet users, governments, and businesses. In response to this threat, this paper proposes to give a complete vision to what Machine learning is, what phishers are using to trick gullible users with different types of phishing attacks techniques and based on our survey that phishing emails is the most effective on the targeted sectors and users which we are going to compare as well. Therefore, more effective phishing detection technology is needed to curb the threat of phishing emails that are growing at an alarming rate in recent years, thus will discuss the techniques of mitigation of phishing by Machine learning algorithms and technical solutions that have been proposed to mitigate the problem of phishing and valuable awareness knowledge users should be aware to detect and prevent from being duped by phishing scams. In this work, we proposed a detection model using machine learning techniques by splitting the dataset to train the detection model and validating the results using the test data , to capture inherent characteristics of the email text, and other features to be classified as phishing or non-phishing using three different data sets, After making a comparison between them, we obtained that the most number of features used the most accurate and efficient results achieved. the best ML algorithm accuracy were 0.88, 1.00, and 0.97 consecutively for boosted decision tree on the applied data sets.

Cybercrime refers to crimes that target computer or network. Computer crimes coated of a broad range of potentially criminal activities. Phishing is the most commonly used attack on social engineering. Through such attacks, the phisher tries to obtain confidential information from the user, with the purpose of using it fraudulently against user's [1, 2] . In today's digitized business world, more and more companies are taking advantage of the ever-evolving opportunities of cyberspace. Due to the growth of internet technology in our daily basis especially due to covid-19 impacts that forced all users to use more the internet in all sectors. Phishing is metaphorically similar to fishing in the water, but instead of trying to catch fish, attackers try to steal the user's personal information. Phishing websites look very similar to their corresponding legitimate websites to attract large numbers of Internet users. Recent developments in phishing detection have led to the growth of many new approaches based on visual similarity.

Machine learning and modern AI techniques have been effeciently employed in several human life applications [3, 4] , many previous researchers employed machine learning in security fields such as in [5] [6] [7] [8] [9] . Computer security attacks are classified into three types: physical attacks, synthetic attacks, and semantic attacks. Phishing is one of the semantic attack types [10] . In such attacks, the vulnerabilities of the users are targeted; for example, the way users interpret computer messages, because most users read information sources without verification and respond to their requests. Phishing is a type of social engineering attack often used to steal user data which is used to access important accounts and can result in identity theft and financial loss. It occurs when an attacker, posing as a trusted legitimate institution, dupes a victim through communication channels. The user is then lured into clicking a malicious link, which can cause the installation of malware, the freezing of the system as part of a ransomware attack, and revealing of sensitive information.

Phishers carry out their attacks by using E-mail''phishing'' which is the most common channel for phishing and reverse social engineering attacks, Instant messaging ''smseshing'' are gaining popularity among social engineers as tools for phishing and reverse social engineering attacks, Telephone, Voice over IP ''vishing'' are common attack channels for social engineers to make their victim deliver sensitive information ,These are designed to lead consumers to counterfeit Web sites that trick recipients into divulging financial data such as usernames and passwords.

For example, according to PhishTank [11] : ''Phishing is a fraudulent attempt, usually made through email, to steal your personal information'' This definition restricts phishing attacks aimed at stealing personal information, which is not always the case. For example, a socially designed message could lure the victim into installing the message in the Browser malware. The software will transfer money to the attacker's bank account, whenever the victim logs in to perform his banking duties without having to steal the victim's personal information. Therefore, we believe that PhishTank's definition is not broad enough to cover the entire issue of fraud. Another definition is provided by Colin Whittaker et al. [12] : ''We define a phishing page as any web page that, without permission, alleges to act on behalf of a third party with the intention of confusing viewers in to performing an action with which the viewer would only trust a true agent of the third party'' ColinWhittaker et. Definition aims to be broader than PhishTank's definition in the sense that attacker'targets are no longer limited to stealing personal information from victims. On the other hand, the definition continues to limit phishing attacks to those acting on behalf of third parties, which is not always true. For example, phishing attacks can deliver socially designed messages to lure victims to websites that need to serve safe content, luring victims to install MITB malware. When the Work-group is loaded, it can record keystrokes to steal the victim's passwords. Note that the attacker in this scenario does not claim the identity of any third parties in the phishing process, only transmits messages with links that will lure victims to view videos or multimedia content. To address the limitations of the previous definitions above, we consider phishing attacks to be semantic attacks that use electronic communication channels to convey socially engineered messages. to convince the victim to take some action for the attacker's benefit.

According to APWG phishing attack trends reports [13, 14] , the number of phishing attacks observed by APWG and its members grew through 2020, doubling over the course of the year. Phishing are spread via e-mail, SMS, instant messaging, social networking etc., but e-mail is a popular way to carry out this attack. The phishing email can lead to financial loss. Attacker always sending email tends to make user believe that they are communicating with trusted entity and deceive them into providing personal credentials in order to access service, such as credit card numbers, account login credential or identity information. In 2019, 293.6 billion emails were sent and received daily. This includes billions of promotional emails sent by merchants every day. While many email users believe that such content belongs in their spam folder, marketing emails are generally harmless if they are uncomfortable for the users. Spam messages accounted for 47.3 percent of e-mail traffic in September 2020 which caused serious economic losses and social problems.

[www.statista.com] Spam e-mail It is almost impossible to think about email without considering the problem of spam. The world's most common variants of malicious spam include Trojan horses, spyware, and ransomware. There are many approaches that have been developed to deal with the spam problem [15] .These days, three ways to mitigate such attacks stand out: Focus based on awareness, based on blacklists, and based on machine learning (ML). However, in the last days, Deep Learning (DL) has emerged as one of the most efficient techniques of machine learning [16] .

In Sect. 1 of this paper, we applied machine learning on three different data sets where the first two datasets depend on multi features and the third one depends on text feature only. Section 2 we review the Related Work of classifiers used in detecting phishing emails, in Sect. 3 we mentioned the targeted victims in phishing. The methodology that has been followed to do this research has been introduced in Sect. 4. Section 5 presents the experiments for classifying Phishing Email Using Machine Learning, Finally, the work is concluded in Sect. 6.

Ease of communicating with advent of email caused the problem of unsolicited bulk email, especially phishing attacks via emails. Various anti-phishing techniques have been developed to solve the problem of phishing attacks. This paper focuses on separating important emails from spam. One of the main factors for classification is how messages will be represented. Specifically, you need to decide which features to use and how to use those features when categorizing them. many researchers have employed AI in intelligent system, and many of them used the Deep learning in cybersecurity applications [17] [18] [19] [20] .

Fette et al. proposed an email filtering approach called PILFER which considered 10 features set including URL based and Script based features to detect phishing attacks [21] . By filtering phishing emails before they are read by users, it can reduce the percentage of users being fraudulent. Phishers can hide the URL and use tools like TinyUrl to make the URL appear valid. Phishers are becoming increasingly sophisticated in their approach and incorporating strategies to bypass existing anti-phishing tools.

Bhat et al. [22] came up with an approach which derives spam filter called Beaks. They classify emails into spam and no spam. Their pre-processing technique is designed to identify tag-of-spam words relevant to the dataset.

Kang Leng et al. proposed anti-phishing tools which depending on nine features derived from structure-based and behavior-based features. They are the sender of the domain's name, the words blacklisted in the subject and the content, the IP address in the URL, the dot in the URL, the symbol in the URL, the unique sender, the unique domain name, the hyperlink not. consistency and return path. All recommended features are selected based on the phishing technique commonly used by phishers, which achieve accuracy of 97.25 [23] [24] [25] .

Teli [26] have showed a three-step system they designed for spam detection methods to classify each new coming emails according to the given algorithm as spam or legitimate email. Ram Basnet et al. [27] employed machine learning algorithms to detect a phishing attack by classifying phishing emails and legitimate emails. they have used sixteen features, the dataset that they used contains 4000 instances with a ratio of 0.75 legitimate emails and 0.25 phishing ones. they split the dataset into 0.50 training and the remaining as a test. the accuracy that they got was 97.99.

Moradpoor et al. [28] have used two datasets that contains 14,370 emails (benign/phishing) , in there detection and classification of phishing emails model based on neural network , the overall accuracies and inaccuracies come to 92.2%. Smadi et al. [29] The authors proposed a model to detect phishing emails by extracting 23 features, they compared different algorithms where random forest achieved the highest accuracy of 98.8%.

In this work, we applied machine learning techniques, to capture inherent characteristics of the email text and other features to be classified as phishing or non-phishing according to the selected data-sets.

In below sections, we summarize some of the identified characteristics of potential phishing victims based on previous studies:

A. Victim's Age: performed a role-play demographics and phishing susceptibility. Young people are accountable for their usage of their devices, but they should know how to protect their devices security. A study with 83 teenagers found that teenagers were poor at distinguishing between legitimate and phishing messages in an experimental task. participants exhibited riskier behavior while making decisions on unfamiliar messages. In a Cross-sectional study of 350 children aged 6 months to 4 years, results show most households had television (0.97), tablets (0.83), and smartphones (0.77). At age 4, half the children had their own television and their own mobile device. Almost all children (96.6) used mobile devices, and most started using before one-year-old. Parents gave children devices when doing house chores (0.70), to keep them calm (0.65), and at bedtime (0.29). At age 2, most children used a device daily. Most 3-and 4-year-old's used devices without monitoring, and one-third engaged in media multitasking [36] . In addition, children aged 5-15 years go online for a minimum of 8 h every week. Examples of common online activities for this age children include communicating through social media, watching YouTube videos, and playing games [37] . One of the primary digital risks that children need to be aware about is phishing, a common social engineering attack ranked as one of the most dangerous online risks for children. More than 1 million children (below 17 years of age) in 2017 in U.S. alone were victims of identity theft with estimated costs of 2.6 billion dollars Several efforts have been made towards designing mechanisms and training tools to help protect people against phishing.

With the prevalence and potential consequences of phishing, continuous efforts are made to improve the cybersecurity knowledge of citizens and to develop protections against phishing attacks. Researchers have explored technical solutions, awareness through cybersecurity educational games and training materials, the addition of cues in the user interface to aid in phish detection. As energetic users of social media websites, teens can regularly share a number of data while interacting and communicating with each other. This sharing may be risky The number of phishing attacks has expanded through the years. A recent survey on nearly 15,000 end-users from seven countries showed that 0.83 of the respondents had experienced a phishing attack in 2018 compared to 0.76 in 2017. Phishing severely impacts businesses; mid-sized companies pay an average of 1.6 million dollars to recover from a successful phishing attack where the consequences include malware infections, compromised accounts, and data loss. More than 1 million children in the U.S. under the age of 17 fell victim to identity theft in 2017 costing approximately 2.6 billion dollars.

Cain et al. made a study on people aged 18 to 55 years and observed that younger people have poor cyber security habits related to password management and phishing. Adults have been shown to have poor calibration between confidence and actual performance when it comes to identifying phishing messages which can then increase the likelihood that the attack is successful [38] .

Our methodology is categorized into the following phases: Datasets Collection, Datasets Preprocessing, Using machine learning classification techniques. We proposed models to classify emails as each model has been built with different functions based on the three datasets with different features. With high probability and filter out legitimate emails as little as possible. This promising result is superior to the existing detection methods and verifies the effectiveness of our models in detecting phishing emails. Experiments phases are presented in details in a separate subsection. Figure 1 illustrates the structure of the proposed detection model. As The first most important step is to have the required dataset, we've used three datasets which has been taken from publicly available resources. The reason for using three datasets with different features is the high rate of changing phishing attack techniques which increases the difficulty of detecting and filtering phishing email attacks. In order to be able to classify the phishing emails and to identify how number of features will affect the efficiency of the model to detect the phishing emails. After collecting data, we reprocessed the datasets by removing the duplicated rows, removing missing values and balancing the instances to achieve the most accurate rate. After importing the processed dataset, we split it into 0.70 to train and 0.30 to test the model. In many cases, the selected ratio is 0.70 of the training set and 0.30 of the test. The idea is that more training data is a more professional as because it makes the classification model better, and more test data makes the error estimate more accurate.

The second phase can be called the training stage. Here the classifier model with the help of the inserted ML algorithm will be trained using the 0.70 dataset to manually classify entered data into a spam or legitimate emails. When the first and second phases are completed, it begins to classify emails according to the given algorithm as spam or legitimate email.

We have applied the above-mentioned model Fig. 1 to the three selected datasets but because of the different features for each dataset we had to use different functions in each model for each dataset so we concluded with seven models as we used seven ML algorithms to compare them to obtain the highest accuracy.

Seven supervised classification algorithms were selected, to train and test the accuracy of phishing email detection with the grouped features. The reason behind selecting these algorithms are the different training strategy were used for discovering the rules and the mechanism of learning and testing. The below listed algorithms are considered as well-known algorithms:

1. Locally-deep support vector machine. 2. Support vector machine. 3. Boosted decision tree. 4. Logistic regression. 5. Averaged perceptron. 6. Neural network. 7. Decision forest .

This utilized dataset was the first dataset we've used, its consists 5,25,754 instances with 8351 as phishing emails and 5,17,402 legitimate emails. Which mean that 98.4% from data are legitimate and the dataset is imbalance. So, to avoid over-fitting we have to reprocessing that data and investigate the balance between phishing and legitimate instances. This dataset consists of 22 features (Total Number of Characters C-Vocabulary Richness W/C -Account-Access-Bank-Credit-Click-Identity-Inconvenience-Information-Limited-Minutes-Password-Recently-Risk-Social-Security-Service-Suspended-Total number of Function words/W-Unique Words-Phishing Status). For that, we chose a random sample with 8351 phishing and 8400 legitimate instances. Then we split the data to 0.70 train and 0.30 test, as detailed in the following Table 5 

The third dataset consists of 2500 ham and 500 spam emails, all the numbers and URLs were converted to strings as NUMBER and URL respectively. This is the simplified spam and ham dataset. We split the data to 0.70 train and 0.30 test as shown in the flowing The first experiment showed the lowest accuracy was averaged with ''0.79586'' and it is clear from Fig. 2 that Boosted Decision Tree gives us the best accuracy with ''0.888181''. Experiment 2: The second experiment were employed the second dataset mentioned in Sect. 5.2.2. The dataset used here completely different with more features as it has 50 features and its instances was 5000 phishing eamils and 5000 legitimate. So after balancing and processing the data we applied the dataset into our model using the selected seven ML algorithms, the results are represented in the following The second experiment results were increased clearly as shown from the table. The lowest accuracy we have got was Neural Network with ''0.995333'' and It is clear from the Fig. 3 that Boosted Decision Tree gives us the best accuracy. Experiment 3: In this experiment, we chose the dataset mentioned in Sect. 5.2.3 with only text feature that contains 2500 ham and 500 spam. For doing so, We have built classifier model Using Python and TensorFlow/Keras neural network. Also, we used Tensorflow which is one of the most popular deep learning libraries to classify Email text. The accuracy was calculated using python model for text classifying (0.992%). For more efficiency and to compare more algorithm techniques we have built special model using AZURE ML Microsoft tools. We have built the model to be able to train and test the dataset text classification efficiently. Then after, we have applied the Clearly we can see here in the third experiment results that the lowest accuracy we've got was Decision Forest with''0.953'' and it is clear from the Fig. 4 that Averaged Perception and Neural Network give us the best accuracy with ''0.977''.

Comparing the results of the previous three experiments, since in the first experiments we used dataset with 22 features, the second experiment we used dataset consists of 50 features while the third experiment used dataset with text feature only. We have end up with results summarised in the following Fig. 5 .

The summarizing Fig. 5 shows that the best ML algorithm accuracy rates achieved was for boosted decision tree and Neural Network and the lowest algorithm was for Decision Forest.

Back to related work mentioned in Sect. 2, starting on comparing our results with Fette et al. [21] they used two non-spam, non-phishing datasets with 10 features on their proposed approach (PILFER). The overall accuracy that they have achieved on their approach was 99.5. However, we used in our first experiment dataset with 22 features and another dataset with 50 features on the second experiment, and as we saw the accuracy increased as we used more features. the difference between our results is that overviewed the dataset and balance the phishing and nonphishing instances in each dataset which makes our result more accurate since they used an imbalanced dataset with 6950 non-phishing emails and 860 phishing emails which means that the percentage for their dataset is 0.80 nonphishing emails. On the other hand, we have made preprocessing for our used datasets. while the first dataset that we used (phishing email collection) was consists of 5,17,402 legitimate emails and 8351 phishing emails the balanced dataset that we preprocessed contains 8351 phishing and 8400 legitimate instances.

In relative to reference [22] , they used a text dataset consists of 1897 spam, 3900 ham, and 250 'hard' ham. while we used in text dataset in the third experiment 2500 ham emails and 500 spam emails. In their work, they have identified Random Forests as the classifier to detect spam mails from ham mails, and the average accuracy that they have is 98.3 using WEKA the open-source software. While we used in our experiments seven algorithms mentioned in section 6, the highest accuracy that we got is 97.7 using AZURE when applying the Neural Network algorithm.

Islam et al. [8] They have proposed a multi-stage classification technique using three popular learning algorithms as NB, SVM and AdaBoost. The dataset they used public datasets PUA, It has been shown that the accuracy of their proposed system (97.05). Compared to our work we have used three datasets with different features and seven 

Phishing emails have come to be a common problem in the latest years. Phishing email attacks are intelligently crafted social engineering email attacks in which victims are conned by email to provide important information and then directly sent it to the phisher. Young users are more likely to fall for phishing attacks. Furthermore, users with agreeable personality trait are likely to be lured by phishing scam more than other users. Women are more likely to provide their personal and financial details to phishing emails and websites. This causal relationship between gender and social engineering is influenced by the internet usage behavior. So the detection of that type of email is necessary.

There are numerous techniques for detecting phishing emails. However, there are a few limitations like accuracy is low. The content material may be the same as legitimate email so cannot be detected, the detection rate is not high.

This work employed machine learning techniques to achieve better results, and to capture inherent characteristics of the email text and other features to classify emails as phishing or non-phishing. This research have come up with a better accuracy of phishing email detection. Which evaluated based on three supervised datasets, and comparison between these classifiers were conducted.

Finally, comparison of the results was obtained using different algorithms. The noted results that using an algorithm based on multi feature of (50) gave us the highest accuracy, and less features of (20) the accuracy was high enough but this result is not effective enough to detect phishing emails. The limitation of this work was finding the predefined dataset.

In Future Work, we noted that Feature selection techniques need more improvement to cope with the continuous development of new techniques by the phishers over the time. Therefore, we recommend developing a new automated tool in order to extract new features from new raw emails to improve the accuracy of detecting phishing email and to cope with the expanding with phisher techniques.

Author Contributions All Three Authors worked in an equivalent load at all stages to produce this research.

Funding This work was supported by the Hashemite University and AL Zaytoonah University of Jordan.

Data availability The data set used in the work will be available upon request.

Conflict of interest The authors have not disclosed any competing interests.

Informed consent I have read and I understand the journal information and have agreed to all mentioned terms and conditions.

Cyber-crime effect on Jordanian society

Deep CNN model for nanotoxicity classification using microscopic images

An intelligent system for blood donation process optimization-smart techniques for minimizing blood wastages

Extreme learning machine for plant diseases classification: a sustainable approach for smart agriculture

Fraud detection in the distributed graph database

Fast attack detection system using log analysis and attack tree generation

A novel mechanism to handle address spoofing attacks in sdn based iot

New direction of cryptography: a review on text-to-image encryption algorithms based on rgb color value

A secure encrypted protocol for clients' handshaking in the same network

Social engineering attacks: a survey

Phishing detection: a literature survey

Large-scale automatic classification of phishing pages

The state of phishing attacks

Evaluation online learning of undergraduate students under lockdown amidst covid-19 pandemic: the online learning experience and students' satisfaction

An email classification scheme based on decision-theoretic rough set theory and analysis of email security

Conference

Deep learning framework for cyber threat situational awareness based on email and url data analysis

Transferable hmm trained matrices for accelerating statistical segmentation time

Efficient 3d medical image segmentation algorithm over a secured multimedia network

Data fusion in autonomous vehicles research, literature tracing from imaginary idea to smart surrounding community

Load balancing techniques in software-defined cloud computing: an overview

Learning to detect phishing emails

Classification of email using beaks: behavior and keyword stemming

Phishing email detection technique by using hybrid features

A platform for power management based on indoor localization in smart buildings using long short-term neural networks

Parallel implementation of fcm-based volume segmentation of 3d images

Effective email classification for spam and non-spam

Detection of phishing attacks: a machine learning approach

Employing machine learning techniques for detection and classification of phishing emails

Detection of phishing emails using data mining algorithms

Who falls for phish? a demographic analysis of phishing susceptibility and effectiveness of interventions

Social phishing

Teaching johnny not to fall for phish

School of phish: a real-world evaluation of anti-phishing training

Getting users to pay attention to antiphishing education: evaluation of retention and transfer

A Personality Based Model for Determining Susceptibility to Phishing Attacks

Exposure and use of mobile media devices by young children

How and why parents guide the media use of young children

Investigating teenagers ability to detect phishing messages

Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations