key: cord-0907367-3agjf0vg authors: Sonowal, Gunikhan title: Detecting Phishing SMS Based on Multiple Correlation Algorithms date: 2020-11-02 journal: SN Comput Sci DOI: 10.1007/s42979-020-00377-8 sha: 945520fb8aab0641d5b67026e39709f29dd0e3f6 doc_id: 907367 cord_uid: 3agjf0vg The SMS phishing is another method where the phisher operates the SMS as a medium to communicate with the victims and this method is identified as smishing (SMS + phishing). Researchers promoted several anti-phishing methods where the correlation algorithm is applied to explore the relevancy of the features since there are numerous features in the features corpus. The correlation algorithm assesses the rank of the features that is the highest rank leads to the more relevant to the appropriate assignment. Therefore, this paper analyses four rank correlation algorithms particularly Pearson rank correlation, Spearman’s rank correlation, Kendall rank correlation, and Point biserial rank correlation with a machine-learning algorithm to determine the best features set for detecting Smishing messages. The result of the investigation reveals that the AdaBoost classifier offered better accuracy. Further analysis shows that the classifier with the ranking algorithm that is Kendall rank correlation appeared superior accuracy than the other correlation algorithms. The inferred of this experiment confirms that the ranking algorithm was able to reduce the dimension of features with 61.53% and presented an accuracy of 98.40%. Phishing is an entirely crucial attack these days where attackers snitch the credentials from the users using social engineering with technologies [44, 46] . Social engineering is the practice of influence and persuasion to deceive victims for acquiring information or performing some operation [16, 38] . The United Nations reports 350% rise in phishing websites during the COVID-19 pandemic [48] . Currently, phishing is expanding rapidly and according to the report concerning the Anti-Phishing Working Group(APWG) [4] , the number of unique phishing websites detected in January-June 2020 is shown in Fig. 1 . Nowadays, attackers employ numerous communication mediums to communicate with the victims such as email, text message( SMS), telephone, and others [5] . However, SMS is one of the feasible mechanisms to effectively communicate with others through mobile phones without the internet. It is generally accepted that every person possesses mobile phones and the number of mobile phone users was estimated at 5.15 billion in 2020 [14] . According to the CallHub, the response rate of 98% SMS messages is 45% in comparison to the email is 28-33% which indicates that the email is 6% lower response rate than SMS [7, 10] . Attackers exploit this service and sending the phishing SMS to the users which are similar to the legitimate SMS to steal the credentials [23, 37, 52] . According to [33] , smishing is a variant of phishing in which attackers send instant messages instead of emails, which appear to have been sent by a genuine organization and demand that the clients tap on a link or disclose the credentials through the text message reply. It is well known that the SMS messages are less expensive, and with the modest SMS package, the phishers are capable of sending a substantial number of messages to the users [15] . As per the report of the security firm, Cloudmark, approximately 30 million smishing messages are dispatched to the mobile users across North America, Europe [22] . Smishing is effective research where several researchers are imparting methods to detect smishing messages. While most of the methods are particularly providing guidelines consisting of awareness of the unknown link in text messages and others. However, very few publications can be found in the literature that addresses the issue of detection of smishing messages. As a result, this paper proposed a model that detects smishing messages with a machine learning algorithm. The motivation of this paper is to determine the best feature set using more than one correlation algorithms. [47] proposed a model smidca which used the Pearson correlation algorithm with the machine learning algorithm to detect the phishing SMS. The weakness of the SmiDCA model was presented with low accuracy. Therefore, this paper applied several correlation algorithms: Pearson rank correlation, Kendall rank correlation, Spearman rank correlation, and the Point-Biserial correlation, and finds the best correlation algorithm which affords the superior accuracy. This paper is organized as below: section "Literature Review" overviews the background works of smishing. Section "Methodology" explains the methodology of the paper. In section "Correlation Algorithms", the correlation algorithm will be discussed. Section "Experimental Analysis" experiments the proposed methodology and result is depicted. Section "Discussion" discusses the outcome of the result and the paper is concluded in section "Conclusion". As referred to above that the attackers practiced several communication mediums to communicate with the victims and SMS is as well as an imperious medium where the phishers attempt mobile phone users. For this purpose, this paper primarily concentrates on the phishing SMS detector methods and recently several researchers are presenting advanced techniques to detect phishing SMS. This section overviews some of the anti-phishing SMS techniques in the remaining parts. [19] proposed a mobile phone specific anti-phishing solution that distinguished the essential measure to improve the tactics to combat phishing assault on mobile. In this solution, the authors identified several versions of phishing attacks regarding mobile devices such as Bluetooth phishing, Smishing, Vishing, and mobile web application phishing, and so forth. This category of a model normally employs the heuristic-based approach where distinct features are extracted to detect phishing attacks. It is known that the characters of the text messages are limited based on the communication protocol. Therefore, most of the attacker sends a short URL to victims, and it is difficult to verify which file or webpage the short URL interfaces to users. Therefore, [35] proposed a method that composes the destination information of the short URL. Furthermore, the method analyzed the webpage and measured the risk of the webpage and blocked the short URL by comparing with predefined threshold. The shortcoming of this model is the determination of the threshold. Another method entitles S-Detector (Smishing detector) was proposed by [27] which differentiates the genuine messages from Smishing messages. This method initially investigated the presence of a URL in the messages. If the URL is detected in the message then the model verifies the URL whether the URL is a short URL or not. If the URL is short then converts into the long URL and verifies the APK file. If the file is present then terminates the investigating otherwise employed the morphological analyzer where the noun words are selected as features. The model implements the Naive Bayesian classifier to blocks the smishing messages and notifies the users regarding the smishing messages. The outcome of the investigation describes that this model provides protection, accessibility, and reliability in preventing shrewder and more malignant security threats. The weakness of this model is the usage of only keywords which is unsuitable for phone number email-id and other attacks. Some more investigation was examined by [6] on the concept of the time duration of the spam messages per day and noticed that the attackers communicate the highest peaks of spam messages from 10 am to 4 pm. Further analysis shows that the familiar words of the spam messages that are candidate, congressmen, election, candidate number, and information of the smishing messages. An operation was conducted by the authors using the contents of the spam sent by each spammer and based on the keywords present in the URL, they were able to classify the smishing messages from the legitimate messages. The model used only militated features which is insufficient for detecting all categories of the smishing messages. In another study to recognize smishing messages [36] , the authors assembled seven features and analyzed these features with a random forest classifier and the outcome of the experiment reveals that the classifier achieved the accuracy of 92%. Lee et al. [30] incorporated Cloud based virtual environment to identify the suspicious URL by verifying whether the URL possesses a position with the downloading APK record or an application without reference. The method further enhanced the probability of smishing identification by practicing the method. Mishra and Soni [34] proposed a recent model that contains multiple filters where SMS Content Analyzer inspects the instant message substance. Naive Bayes Classification Algorithm arranges the malicious substance and keywords present in the instant message. URL Filter assesses the URL to recognize malicious features. Source Code Analyzer looks at the source code of the site to distinguish the unsafe code installed in it. Apk Download Detector distinguishes whether any malicious record is downloaded while conjuring the URL. The results of the analyses show an accuracy of 96.29%. Although the model employed multiple stages for detecting the smishing messages, it requires other classification algorithms for increasing accuracy. One fashionable model smidca [47] which collected 39 features to distinguish the smishing messages. This model operated a random forest classifier with the features selection algorithm which achieved 96.16% accuracy and the model was capable to lessen the feature dimension over 40.71% with the help of feature selection algorithm. The weakness of this model is low accuracy. The aforementioned smishing detection models primarily focus on the URL and the attacker employs the short URL to hide the malware file. The shortcomings of these models are the usage of the only URL and limited keywords. The accuracy of the above model is providing low accuracy. Therefore, this paper collected 52 features for detecting phishing SMS and four different ranking algorithms are used to rank the features. With the assistance of the AdaBoost classifier, the model evaluated the accuracy of 98.40% using Kendall rank correlation even though 61.53% features are pruned from the feature corpus. This paper consists of three components: Feature collection, Feature ranking, and searches the best feature set using a machine learning algorithm. The feature collection component collects features from the existing and novel features from the SMS and builds a feature vector. The feature vectors are sent to the ranking algorithm to search the relevancy of the features. In the ranking algorithm, the proposed model employs four correlation algorithms where the highest rank features indicate more relevant under the specific task. After arranging the features based on the ranking, the proposed model employs a sequential forward feature selection algorithm for searing the best feature set. [43] . The text style of phishing messages and legitimate messages is different which leads to the discrimination features for detecting the phishing messages. Six algorithms are employed in the feature corpus which is explained below: -Automated readability index(F34): Smith and Senter [42] proposed the Automatic readability index for measuring the readability score. The equation of the automatic readability index is shown in equation (1) where L be the number of letters and numbers, W is the number of spaces, and S is the number of sentences. (2) where TW be the total words, TS be the total sentence, Tsy be total syllables and Tsy be the total where TP be the total number of Polysyllables and TS be the Total sentence. The term "correlation" is approved to evaluate the relationship between quantities 1 . Several correlation algorithms are employed to rank the features for evaluating the relevancy of the features to reduce the dimension of the feature corpus. This paper examines four kinds of correlations: Pearson rank correlation, Kendall rank correlation, Spearman rank correlation, and the Point-Biserial correlation [11, 12, 28] . Assume, X = {x 1 , x 2 , … x n } is the feature vector and Y = {y 1 , y 2 , … y n } is the decision vector. The correlation algorithm computes the rank of the feature by comparing the feature vector with the decision vector. Pearson correlation coefficient is broadly admitted in the feature selection algorithm to determine the best feature set [47] . The PCC is defined by and equation is shown in (7). where cov(P, Q) denotes the covariance of P, Q, and P and Q are the standard deviation of P and Q. where p and q are mean of P and Q. Several researchers practiced Spearman rank correlation for feature selection algorithms [41, 50] . The equation of Spearman's coefficient is similar to the Pearson and the simplified version is shown in Eq. 9. where S Spearman rank correlation, n number of observations Unlike Spearman's coefficient, Kendall's does not consider the difference between ranks-only directional agreement [20, 29] . The equation of the Kendall rank correlation is shown in the Eq. (10). where n c number of concordant, n d number of discordant The point-biserial correlation is related to the Pearson correlation equation except that one of the factors is dichotomous [9, 31] . The equation of the point biserial rank correlation is shown in the Eq. (11) . where S y is standard deviation, Ȳ , X are mean values. The Machine learning algorithm is widely studied in the SMS classification. Numerous classification algorithms are applied with a specific end goal to recognize phishing SMS. This paper performs four well-known classifiers such as AdaBoost, random forest, Decision Tree, and Support Vector Machine. Decision Tree The decision tree applies both categorical and continuous input and it separates the data into two or more homogeneous sets based on the most important splitter in input features [49] . The feature (attribute) in the decision tree is described by each node, the decision is accepted by each link (branch) and the result (discrete or continuous value) is evaluated by each leaf. The weakness of the decision tree is the determination of the feature for the root node in each level which is known as feature selection. Therefore, two major feature selection algorithms are adopted to determine the root: Information Gain (IG) and Gini Index (GI). Random forest Algorithm The random forest algorithm is an ensemble classification algorithm; that is, a gathering of classifiers [1, 2, 8, 24, 45] . Rather than using only one classifier to foresee the target, in an ensemble, various classifiers to anticipate the target. In the random forest, these ensemble classifiers are the arbitrarily generated decision trees and each decision tree is a single classifier and the target prediction depends on the majority voting technique. Therefore, the target class receiving the majority number of votes regards as the final predicted target class. The support vector machine is operated based on the concept of locating a hyperplane that maximizes the margin between the two classes [17, 25, 51] . The vectors that represent the hyperplane are the support vectors. A hyperplane is a decision plane that divides the set of different classes and the margin is a gap between two lines which is computed using the (10) = n c − n d 0.5 * n(n − 1) perpendicular distance from the line to support vectors. If the margin is larger in between the classes, then it is considered an acceptable margin, a smaller margin is an unacceptable margin. AdaBoost Boosting is a general ensemble technique that produces a reliable classifier from several weak classifiers. Hence, the AdaBoost is a boosting algorithm developed for binary classification and best utilized to promote the execution of decision trees [26, 39, 40] . A weak classifier is set up on the training data using the weighted samples; therefore, each decision stump settles on the decision on one input variable and outputs a + 1.0 or − 1.0 value for the first or second class value. The misclassification rate is computed from the trained model. Although, feature selection algorithms are implemented for multiple objectives such as enhanced accuracy, decreases complexity, faster training for machine learning algorithms, and others. However, this paper primarily operates the machine learning algorithm to improve accuracy. Once the features are ranked using the ranking algorithm, the features are allotted to the search algorithm to search the best feature set. This paper employs the sequential forward feature selection algorithm to explore the best feature set. The sequential forward feature selection combines features one by one to the features set according to the highest rank orders. However, the limitation of this algorithm is the termination point, otherwise, the algorithm would continue until the end of the features. Therefore, this paper adopted the policy defining by Sonowal and Kuppusamy [47] that if the classifiers provide constant accuracy or less accuracy continuously three times then the algorithm would terminate. This paper gathered data on phishing and Ham SMS from Tiago A. Almeida [3] is shown in Table 1 . This data was employed for several machine learning methods in order to verify the performance of the proposed methods. The experiment is carried out with three steps: the first step is to determine the best classifier, the second step is to find the relevant feature using a correlation algorithm and the third step is to evaluate the best feature set using the best classifier with relevant features. Once the features are obtained, the model initially selected all the features in the first step and experimented with the four well-known classifiers that are Random forest, decision tree, AdaBoost, and support vector machine to explore the best classifier as explained in section "Machine Learning Algorithm". Table 2 shows that the AdaBoost performed slightly better accuracy than other classifiers. Therefore, AdaBoost is selected for further experiments. In the second step, the model applies the correlation algorithm to rank the features which imply that more rank produces more relevant to the particular assignment. As defined above that this paper adapted four types of correlation algorithms as explained in section "Correlation Algorithms". The different correlation algorithms recognize the different features as relevant. The model employs the sequential forward feature selection algorithm to determine the best features set in the third step. The sequential forward feature selection algorithm takes the feature one by one based on the ranked of the features and evaluates the accuracy with the best classifier as explained in section "Feature Search Algorithm". The model evaluated the accuracy separately of all the correlation algorithms. Table 3 shows that the diverse correlation algorithm's accuracy based on their ranking of the features. If the accuracy of the experiment is examined then it was observed that all the algorithms produced equivalent accuracy. However, a closer inspection revealed that the Kendall rank correlation offered slightly better accuracy(98.40%). Furthermore, the number of features is additionally an imperative aspect of the methodology. The table demonstrates that Spearman's rank correlation used only 18 out of 52 features which indicates that this correlation reduced the features corpus (65.38%), while, the accuracy is slightly lesser than Kendall rank correlation. In the event, it has been noticed the rate reduction of features for Kendall then it has found that the (61.53%) are features have been pruned. Table 4 . The result shows that the proposed method provided better accuracy in contrast with other methods. Further comparison, it can be seen that the proposed model required the same number of features as SmiDCA but provided better accuracy. From the result, it can be concluded that the proposed method has the potentials to detect phishing SMS adequately. The aim of this paper is to detect smishing messages using a correlation algorithm with a machine learning classifier. Initially, this paper collected 52 features from the different directions of the SMS and experimented through four well-known classifiers. The result shows that the AdaBoost classifier provided better accuracy with 98.67%. Although the accuracy was satisfactory to detect smishing messages, the feature dimension was too high that was 52 features. In this way, this paper adopted a feature selection algorithm to reduce the dimension of the features. This paper used four ranking algorithms to rank the features and employed the sequential forward feature selection algorithm to search the best features set. The experiment shows that the Kendall ranking algorithm offered superior accuracy (98.40%) with AdaBoost classifiers. Furthermore, this algorithm has lessened the number of features with 61.53% that indicated the proposed model could able to prune more than half of the features. Finally, the result of the investigation was contrasted with other anti-smishing methods and the comparative analysis demonstrate that the proposed model furnished better accuracy. From the examination, it tends to infer that the proposed model tends to detect the smishing messages. These days, the smishing messages are hastily growing and it dominates cyber-attack in the cyberspace. Although, most of the researchers are imparting several advanced techniques to reduce the pace concerning these attacks, they are still failed to obtain complete detection. A large scale of features is practiced to increase the accuracy of detection. Whereas it is not true that the highest features would contribute better accuracy. Therefore, the feature selection algorithm appears in this scenario to reduce the feature dimension. This paper employed four ranking algorithms to rank the features such as Pearson rank correlation, Spearman's rank correlation, Kendall rank correlation, and Point biserial rank correlation. Initially, this paper selected the machine learning classifiers and found that the AdaBoost classifier offered the better accuracy. Furthermore, with the ranking algorithm that is Kendall rank correlation offered superior accuracy than the other correlation algorithms. The inferred of this experiment shows that the ranking algorithm was able to reduce the feature dimensionality with 61.53% and provided an accuracy of 98.40%. In the future, more features and an advanced feature selection algorithm would be applied. The target of the future model is to find the best feature set with less time. Conflicts of interest The authors declare that they have no conflict of interest. Ethical approval This article does not contain any studies with human participants performed by any of the authors. 1. Abu-Nimeh S, Nappa D, Wang X, Nair S (2007) A comparison of machine learning techniques for phishing detection. In: of the Anti-phishing Working Groups 2Nd Annual eCrime Researchers Summit Classification of phishing email using random forest machine learning technique Anti-Phishing Working Group (APWG) Proposing a new clustering method to detect phishing websites Property analysis of sms spam using text mining Text marketing vs. email marketing: Which one packs a bigger punch Classifying phishing emails using confidence-weighted linear classifiers Applied statistics: correlation coefficients 6 reasons why sms is more effective than email marketing-callhub Correlation: parametric and nonparametric measures Testing dependent correlation coefficients via structural equation modeling A computer readability formula designed for machine scoring ~:text=The%20num ber%20 of%20mob ile%20pho ne,lates t%20dat a%20fro m%20GSM A%20 Int ellig ence Sms spam filtering: methods and data Ethical hacking and countermeasures: web applications and data servers Learning to detect phishing emails A new readability yardstick Phishing detection taxonomy for mobile device Feature selection for ranking The technique of clear writing Cellular network fraud & security, jamming attack and defenses Random decision forests A svm-based technique to detect phishing urls A multi-tier phishing detection and filtering approach S-detector: an enhanced security model for detecting smishing attack for mobile computing Rank correlation methods, trans. London: Edward Arnold Rank correlation methods A study on realtime detecting smishing on cloud computing environments The point biserial coefficient of correlation Smog grading-a new readability formula Protect yourself from smishing Smishing detector: a security model to detect smishing through sms content analysis and url behavior analysis Secure short url generation method that recognizes risk of target url Disributed system for smishing detection Survey about attack and defence phishing techniques Social engineering: concepts and solutions Phishgillnet-phishing detection methodology using probabilistic latent semantic analysis, adaboost, and co-training Phishing detection and impersonated entity discovery using conditional random field and latent dirichlet allocation Joint European conference on machine learning and knowledge discovery in databases Automated readability index Phishing email detection based on binary search feature selection Journal of King Saud University-Computer and Information Sciences Mmsphid: a phoneme based phishing verification model for persons with visual impairments Masphid: a model to assist screen reader users for detecting phishing sites using aural and visual SN Computer Science similarity measures Smidca: an anti-smishing model with machine learning approach Increasing cybercrime: un reports 350 per cent rise in phishing websites during pandemic Phishing detection using classifier ensembles A simple filter benchmark for feature selection Profiling phishing emails based on hyperlink information Phishing, smishing & vishing: an assessment of threats against mobile devices