key: cord-0963559-ipfru0e4
authors: Mvula, Paul K.; Branco, Paula; Jourdan, Guy-Vincent; Viktor, Herna L.
title: COVID-19 malicious domain names classification [Image: see text]
date: 2022-05-20
journal: Expert Syst Appl
DOI: 10.1016/j.eswa.2022.117553
sha: 88f6846918f685c7c888c12a20558e02f91bbd8b
doc_id: 963559
cord_uid: ipfru0e4

Due to the rapid technological advances that have been made over the years, more people are changing their way of living from traditional ways of doing business to those featuring greater use of electronic resources. This transition has attracted (and continues to attract) the attention of cybercriminals, referred to in this article as “attackers”, who make use of the structure of the Internet to commit cybercrimes, such as phishing, in order to trick users into revealing sensitive data, including personal information, banking and credit card details, IDs, passwords, and more important information via replicas of legitimate websites of trusted organizations. In our digital society, the COVID-19 pandemic represents an unprecedented situation. As a result, many individuals were left vulnerable to cyberattacks while attempting to gather credible information about this alarming situation. Unfortunately, by taking advantage of this situation, specific attacks associated with the pandemic dramatically increased. Regrettably, cyberattacks do not appear to be abating. For this reason, cyber-security corporations and researchers must constantly develop effective and innovative solutions to tackle this growing issue. Although several anti-phishing approaches are already in use, such as the use of blacklists, visuals, heuristics, and other protective solutions, they cannot efficiently prevent imminent phishing attacks. In this paper, we propose machine learning models that use a limited number of features to classify COVID-19-related domain names as either malicious or legitimate. Our primary results show that a small set of carefully extracted lexical features, from domain names, can allow models to yield high scores; additionally, the number of subdomain levels as a feature can have a large influence on the predictions.

The novel coronavirus disease, or COVID-19, is a highly contagious respiratory and vascular disease caused by infection with severe acute respiratory syndrome coronavirus 2 (SARS-Cov-2). On December 8, 2019, WHO officials reported the earliest onset of symptoms, and they declared the outbreak a pandemic on March 11, 2020. The disease went on to infect over 460 million people and caused over 6 million deaths in 225 countries and territories, as of March 15, 2022. As a result of the COVID-19 pandemic, public health officials and governments have been working together to lower the transmission and death rates and control the pandemic. For example, they mandated various measures in place, including social distancing, mandatory wearing of face masks when going into public and crowded places such as supermarkets, and frequent hand washing. Most importantly, officials in many countries ordered people to stay at home and avoid unnecessary trips. Moreover, some governments imposed stricter measures, such as border closures, lockdowns, and curfews. Despite these measures, the transmission rate and death toll have not abated. The active measures put in place have forced companies worldwide to switch from offline working to remotely working from home in response to some countries' authorities banning gatherings of more than five people. In addition, workers who had traveled overseas for holidays were stuck in the countries they were visiting. Schools were also forced to change over to distance learning. The IBM J o u r n a l P r e -p r o o f Journal Pre-proof X-Force Incident Response and Intelligence Services (IRIS) 1 , together with Quad9 2 , a public recursive DNS resolver offering end users security, high performance, and privacy, have been tracking cybercrime that has capitalized on the coronavirus pandemic since its beginnings. These services have uncovered a variety of COVID-19 phishing attacks against individuals and organizations with a potential interest in the technologies associated with the safe delivery of the vaccine against the coronavirus. IRIS and Quad9 observed that over 1000 coronavirus-related malicious domain names were created between February and March 2020. Correspondingly, in a trend report released in 2020, the Anti-Phishing Working Group (APWG) 3 , an international consortium dedicated to promoting research, education, and law enforcement to eliminate online fraud and cybercrime, has revealed an increase in the number of unique malicious websites. Since that remarkable spike in March, the same month the outbreak was declared a pandemic, the number of COVID-19-related threat reports has been increasing on a week-to-week basis, according to data collected from IBM X-Force. Meanwhile, Quad9's attempts to detect and block malicious COVID-19-related domain names have led to reports of a substantial increase in the number of domain names blocked as being associated with malicious COVID-19 activity. According to IBM X-Force IRIS research, attackers are using email, SMS text messages, and social media to deliver malicious websites of spoofed government portals requiring users to input financial and personal data. After obtaining this information from victims, attackers may open bank accounts and even apply for loans in their names. These findings highlight that anti-phishing developments are being made by industry and academia to prevent attacks, paying special attention to those linked to COVID-19, Basit (2021) . This emphasises the significance of developing effective solutions for COVID-19-related attacks, which are on the rise. According to Le Page (2019), anti-phishing J o u r n a l P r e -p r o o f Journal Pre-proof development efforts can be grouped into 3 main areas, as depicted in Figure 1 :

• Detection techniques.

• Understanding the phishing ecosystem (phishing prevention).

• User Education. Varshney et al. (2016) Although user education and understanding the phishing ecosystem are crucial for phishing prevention, end users, as the weakest link in the security chain Schneier (2000) , remain vulnerable to attacks because attackers can use new techniques and exploit those vulnerabilities to trick them into clicking on links and inputting their personal information. This phenomenon was especially evident during the onset of the COVID-19 pandemic, where the extraordinary situation and subsequent lockdowns left end users more prone to respond atypically to phishing attacks. Specifically, cybercriminals leveraged people's heightened online activity spurred by COVID-19 to launch coronavirus-themed phishing attacks. In this paper, we focus on detection techniques in the context of the pandemic. This paper introduces a ML model that classifies COVID-19-related domain names as malicious or legitimate by using different ML algorithms and comparing them to yield the best result. First, we collected legitimate, confirmed malicious domain names containing keywords related to COVID-19 from publicly available sources to construct our own data set. Next, we extracted useful features from the generated data set and then moved on to perform model construction and algorithm tuning. In summary, our contributions are as follows:

J o u r n a l P r e -p r o o f Journal Pre-proof 1. We addressed the problem of malicious domain name detection, focusing on those related to COVID-19.

2. We proposed a small feature set that may be extracted from the data in a timely manner.

3. We developed both online and batch learning models for malicious domain name detection. 4. We conducted several experiments with varied data distributions and feature sets to evaluate the performance of both the batch and online learning methods.

The remainder of the paper is organized as follows: the next section examines related works and literature concerning malicious URLs and domain name detection. Then, Section 3 discusses the difficulties involved in the detection of a malicious domain name. The steps taken in the acquisition of the data set and the feature extraction process are explained in Section 4, while Section 5 describes the experimental setup and proposed model. Next, Section 6 presents and compares the results of the different algorithms, followed by Section 7, which offers conclusions and suggests future works on the topic. Finally, Section 8 contains acknowledgments, followed by references.

Software-based detection techniques are generally divided into three classes: visual similarity-based (VSB) detection systems (Jain & Gupta, 2017) , listbased (LB) detection systems (Han et al., 2012; Cao et al., 2008) , and machine learning based (MLB) detection systems.

VSB approaches can be grouped into HTML DOM (hypertext markup language Document Object Model), Cascading Style Sheet (CSS) similarity, visual features, visual perception, and hybrid approaches (ALmomani, 2013). Rosiello et al. (2007) presented an approach based on the reuse of the same data across J o u r n a l P r e -p r o o f Journal Pre-proof different sites. In the proposed approach, a warning would be generated if a client reused the same data (i.e., same username, password, and so forth) on numerous websites. The system analyzed the DOM tree of the first web page where the data was initially entered and the subsequent web page where the information was reused. If the DOM tree between these two web pages was discovered to be similar, then the system considered it as an attack, or else a legal reuse of data. Although this system presented a high true positive rate of almost 100%, it failed when the malicious sites contained only images. Abdelnabi et al. (2020) introduced VisualPhishNet, a triplet-network framework that learned a similarity metric between any two same-website pages for visual similarity phishing detection. The training of VisualPhishNet was divided into two stages: training on all screenshots with random sampling and fine-tuning by retraining on those samples that were difficult to learn in the first stage. The authors reported a Receiver Operating Characteristic (ROC) Area Under the Curve (AUC) of 98.79% on their data set, VisualPhish, which contained 155 websites with 9363 screenshots. Although this approach yielded a high ROC and could detect zero-day attacks, it failed when the malicious pages had popup windows covering the site's logo. Dunlop et al. (2010) developed goldphish, a browser plug-in, which was designed to recognize malicious sites. It extracted the suspicious site's logo and converted it to text using optical character recognition (OCR) software; the extracted text was then used as a query for the Google search engine. The suspicious site's domain was compared with the top search results, and if a match was detected, the site was declared legitimate; otherwise, it was considered malicious. Although goldphish could detect new attacks (known as zero-hour attacks) and identify well-known companies' logos, it was unable to convert text from logos with dark backgrounds. Chiew et al. (2015) presented an extension of goldphish that used the logo for phishing detection.

This approach was divided into two stages: logo extraction and site identity confirmation. The hybrid method used ML to extract the right site's logo from all pictures present. Next, the right logo was queried in "Google Images" to obtain the corresponding domain names in the results. Then, like goldphish, more malicious URLs/domain names than Google Safe Browsing. Cao et al. (2008) developed the automated individual white-list (AIWL), a system that builds a white-list by registering the IP address of each site with a login interface visited by the user. A user who visits a website is notified of any inconsistency with the registered information of the visited website. One of the AIWL's pitfalls is that it warns the user each time they visit a site with a login interface for the first time.

ML has proven itself a very useful tool in cybersecurity as well as other fields of computer science and has extensively featured in the literature for malicious J o u r n a l P r e -p r o o f Journal Pre-proof activity detection. The case of malicious domain name/URL detection is simply a classification task classifying phishing domain names/URLs from legitimate to malicious with the generated weights from the features and data set; therefore, generating a good model requires that the data set must contain good data and a set of relevant features should be extracted from the data. Jain & Gupta (2018) presented an anti-phishing approach that used machine learning to extract 19 client-side features to classify websites. The authors used publicly available phishing pages from PhishTank 4 and OpenPhish 5 , along with legitimate pages from Alexa's well-known sites, some top banking websites, online payment gateways, etc. With the use of ML, their proposed approach yielded a 99.39% true positive rate. Zhang et al. (2007) developed CANTINA, a textbased phishing detection technique that extracted a feature set from various fields of a web page using the term frequency-inverse document frequency (TF-IDF) algorithm (Zhang et al., 2007) . Next, the top five terms with the highest TF-IDF values were queried using the Google Search engine. If the website was included in the top "n" results, it was classified as legitimate. However, CANTINA was only sensitive to English, which meant its performance was af- (2017) proposed a model that separated URLs into n-grams and demonstrated the effectiveness of this method using Shannon's entropy. The authors made a positive contribution by proving that those character distributions in the URLs were skewed due to confusion techniques used by attackers. Thus, we borrowed the idea of using Shannon's entropy as a feature from this paper. F. Tajaddodianfar et al. (2020) proposed TException, a character-and word-level embedding model for detecting phishing websites. They used fastText, Bojanowski et al. (2017) , to generate embeddings from words found in URLs and build an alphabet of all characters seen at training in order to generate low-dimensional vectors (embeddings) for the characters. Each of these embeddings was passed to multiple convolutional layers in parallel, each with different filter sizes, whose outputs were then concatenated for classification. They trained and tested their approach on data collected from Microsoft's anonymized browsing telemetry data.

The training set consisted of 1.7M samples, with 20% sampled for validation, and the testing set comprised 20M samples. By randomly initializing word embeddings for the fastText model, the authors reported a 0.28% error rate and 0.9943 Area Under the Receiver Operator Characteristic Curve. In a real-time environment, the detection of an attack should be effective and instantaneous. LB approaches are fast; however, they are limited because the update of those J o u r n a l P r e -p r o o f Journal Pre-proof lists in a timely manner constitutes a difficult task and often requires more system resources. Therefore, they are not able to detect zero-hour attacks. In contrast, while VSB approaches may achieve high accuracy, their weaknesses include the following: they are very complex in nature, they have high processing costs because of the necessity of storing prior knowledge about the websites or large databases of images, and they fail to detect zero-hour attacks, Jain & Gupta (2016) . Lastly, MLB techniques tend to require computational resources and time for training, as well as frequent updates to the feature set for training when attackers start bypassing the current features. Nevertheless, their great advantage is that using new sets of classification algorithms and features can improve accuracy, making the scheme adaptable, Varshney et al. (2016) .

During the COVID-19 pandemic, attackers have employed various approaches to steal information from users and avoid being detected by the security systems in place. One technique includes registering malicious domain names related to COVID-19. The literature offers many approaches designed to detect malicious activities, some of which were described in the previous section. Most of them aim at detecting attacks from entire URLs. Although those methods can be effective, they are limited in the sense that knowledge of the URL is only obtained from the attacker through the attack itself (usually by examining the phishing message). Detection of the malicious site prior to the attack is not possible if the URL is required. Moreover, since several malicious URLs can be generated from a single domain name, when a URL is flagged as malicious, an attacker can simply alter the combination and generate another URL with the same domain name. To avoid these limitations, we tailored our approach by only looking at the domain name. Domain names are a combination of: • The Second-Level Domain (SLD) name, which is the "name" registered by end users,

• The Sub-Domain(s), which can be added at will by SLD owners.

We limited our focus to domain names because once a domain name has been flagged as malicious, all the URL combinations from that domain name will be considered malicious since a SLD can only be set once, upon its creation.

In seeking to increase the effectiveness of an attack and obtain more victims, attackers mainly use a combination of techniques, such as random characters, typo-squatting, etc., in the domain name to trick users into believing they are browsing a legitimate website. Because we only used publicly available information to detect malicious domain names, data protection regulations prevented us from accessing useful registration information that could be used for our purpose. Therefore, it was difficult for us to find a high-quality, worldwideaccepted data set consisting entirely of verified malicious and legitimate domain names, or at least featuring low false positives and false negatives. Thus, an important challenge associated to using solely domain names is the amount of available data. However, developing an efficient approach that focuses solely on domain names is a strength of the proposed approach because it can be efficiently deployed in a DNS resolver, such as IBM Quad9 8 , and thus protect users there without requiring access to their browsing information. The next section describes the steps taken to build the data set and extract the features.

In computer science, the quality of the output is determined by the quality of the input, as stated by George Fuechsel in the concept "Garbage in, Garbage out." Therefore, not only was a good data set needed for the implementation and J o u r n a l P r e -p r o o f Journal Pre-proof comparison of the models, but a set of useful features also had to be extracted from the data set.

We constructed our own labeled data set containing two classes of domain names: legitimate and malicious domain names. The malicious domain names came mainly from DomainTools 9 and PhishLabs 10 . DomainTools provides a publicly available list of COVID-19 related domain names, which is updated daily. On their list, each domain name is associated with a risk score of 70 or higher by DomainTools' risk-scoring mechanisms. For this research, we only kept domain names with a risk score greater than or equal to 90 because these domain names were not manually checked; the higher the risk score, the more likely the domain names were to be confirmed as malicious. PhishLabs, on the other hand, provided intelligence on COVID-19 related attacks from March to September 2020, and the information collected on malicious domain names is available on their website. Though we aimed to utilize a publicly available list of confirmed legitimate domain names accepted worldwide, we were not able to find any. Therefore, we compiled our own list with the use of several search engines' APIs. We used Google Custom Search 11 , Bing Search 12 , Yandex Search, and MagicBaidu Search API 13 . First, a specific keyword list was built with the COVID-19-related keywords shown in Table 1 . Then, the keywords were sent to the search APIs to obtain the first hundred results, which had a very low chance of being malicious pages. This phenomenon originates from the fact that web crawlers do not assign a high rank to malicious domain names because of their short lifetime Le Page (2019). Therefore, this collection process causes limitations in terms of legitimate domain names size, making the number of legitimate domains retrieved much smaller than the number of malicious 9 https://www.domaintools.com/resources/blog/ 10 https://www.phishlabs.com/covid-19-threat-intelligence/ 11 https://developers.google.com/custom-search 12 https://www.microsoft.com/en-us/bing/apis/bing-web-search-api 13 https://github.com/1049451037/MagicBaidu 

After building the data set, extracting a set of relevant features was the next critical step. Korkmaz et al. (2020) and Buber et al. (2017) list and identify several features commonly used for phishing website detection. After a long discussion with our domain expert, we extracted a set of lexical features from the domain names in the data set. Initially, 12 features, divided into 9 groups, of which 7 can be found in the articles listed above, were extracted as shown in Table 3 .

Freenom TLD Numeric checks whether the top-level domain belongs to Freenom 17 , ".gq", ".cf", ".ml", ".tk" and ".ga".

Most of these free domain names are used for malicious intent. With respect to the "entropy" feature, we calculated three entropy values:

1. The entropy of the domain name with the subdomain and suffix.

2. The entropy of the domain name with the subdomain but without the suffix.

3. The entropy of only the domain name, i.e., excluding both the subdomain and suffix.

With the exception of the Tranco rank, other web-based features requiring domain lookups were not used in this project because fetching them would have required excessive time; the process might take days for a huge data set.

We applied two architectures to our data set: λ, also referred to as batch learning, where the model was treated as a static object, i.e., trained on the available data set and then making predictions on new data and retrained from scratch in order to learn from the new data, and κ, where the model learned as instances arrived, also referred to as online learning. Online learning differed from batch learning in that the online model, in addition to making predictions for new data, was also able to learn from it. We did not have a fixed data set but a data stream that was continuous and temporalordered. Figure Consider a real-valued random variable r whose range is R (e.g., for a probability, the range is one, and for an information gain, the range is log c, where c is the number of classes). Suppose n independent observations of this variable have been made and their mean isr. The HB states that, with probability 1−δ,

Journal Pre-proof the true mean of the variable is at leastr − ϵ, where:

Equation (2) shows that the HB is independent of the probability distribution generating the observations. Our choice of HTs was motivated by the fact that not only did they quickly achieve high accuracy with a small sample, but they were also incremental and did not perform multiple scans on the same data and therefore resolved the time and computational resources limitations of MLB approaches. We used the Hoeffding Tree classifier developed in scikit-multiflow.

The major drawback of the HT is that it cannot handle changes in the data distribution, i.e., concept drift. The Decision Tree Classifier (DTC) is a powerful algorithm used for supervised learning, as it aids in selecting appropriate features for splitting the tree into subparts and finally identifying the target item. At the bottom of the tree, each leaf is assigned to a class. The DTC can be used for classification as well as regression. The attributes of the samples are assigned to each node, and each branch's value corresponds to the attributes, Flach (2012) . RFCs are ensemble algorithms developed by Breiman (2001) .

RFCs attain high accuracy, are able to handle outliers and noise in the data, and are powerful in the sense that the decision comes from the aggregate of all the decisions of the trees in the forest. Thus, they are less prone to overfitting, and their accuracy is relatively high. The weaknesses of RFCs include the following: When used for regression tasks, they have difficulty in predicting beyond the range in the training data, and they may over-fit data sets that are particularly noisy, Benyamin (2012) . GBM (Friedman, 2001 (Friedman, , 2002 is another ensemble method that uses decision trees as weak learners. The main objective is to minimize the loss function, or the difference between the actual class value of the training example and the predicted class value, using first order derivatives. XGBoost (Chen & Guestrin, 2016) is another boosting method that builds decision trees sequentially. It is a refined, customized version of a gradient boosting decision tree system that also utilizes the second derivatives and was created for performance, speed, and pushing the limit of what is J o u r n a l P r e -p r o o f Journal Pre-proof possible for gradient boosting algorithms. Among the algorithm's features, the regularized boosting that prevents overfitting, the ability to handle missing values automatically, and the tree pruning, which results in optimized trees, make the algorithm very powerful. GBM and XGBoost's main weakness is that they are more likely to overfit the data due to those decision boundaries. Nevertheless, they often construct highly accurate models. The Support Vector Machine (SVM) (Vapnik, 1998) , an important algorithm in machine learning, is a geo- (5); then, using gradient descent, the change in weight ∆w ji (n) is defined by (6), where y i is the output of the previous neuron and η is the learning rate selected to ensure a quick convergence of the weights to a desired response without oscillations.

J o u r n a l P r e -p r o o f

Neural networks are able to learn the function that maps the input to the output without it being explicitly provided and can also handle noisy data and missing values. However, their key disadvantages are that the answer that emerges from a neural network's weights is difficult to understand, i.e., a black box. In addition, training often takes longer than it does for their competitors.

In the λ architecture, we conducted a 10-fold cross-validation to assess the overall performance of the algorithms on the data set. For the online learning approach (κ architecture), in contrast, we trained the HT classifier with streams of different sizes and observed the performance. In total, we generated 6 streams: 

Frequently, the performance of a model is evaluated by constructing a confusion matrix. From the values of the confusion matrix, metrics such as the accuracy, the F-1 score, the specificity (True Negative rate, recall of the negative class), the sensitivity (True Positive Rate, recall of the positive class), precision, and so forth can be calculated to evaluate the efficiency of the algorithms. Equations for calculating the metrics are shown in (7-11), where TP represents the true positives, the domain names predicted as legitimate that were truly legitimate, TN the true negatives, the domain names predicted as malicious that were truly malicious, FP the false positives, the domain names predicted as legitimate but were in fact malicious, and FN the false negatives, the domain names predicted as malicious but were in fact legitimate, as produced by the algorithms.

Sensitivity(T P R) = T P/(T P + F N )

Specif icity(T N R) = T N/(T N + F P ) (8)

F − 1 = 2 × P recision × Sensitivity P recision + Sensitivity (10) Accuracy = (T P + T N )/(T P + T N + F P + F N ) (11)

As mentioned in the previous sections, our data set was highly imbalanced;

because the malicious class contained about 9 times more domain names than the minority class, we needed to resample the data set to change the class distribution and make it more balanced. Several sampling techniques have been proposed: some that add more samples in the minority class (oversampling),

Journal Pre-proof others that remove samples from the majority class (undersampling) , and yet others that employ both oversampling and undersampling. Each of these approaches has associated advantages as well as disadvantages. Oversampling, by increasing the number of samples in the minority class, increases the chances of overfitting along with the learning time, as it makes the data set larger. In contrast, the main disadvantage of undersampling is that it removes data that can be useful. As we were in a time-sensitive environment and oversampling would increase the learning time, we defaulted to undersampling. Several undersampling methods have been proposed, including methods that select examples to keep (e.g., Near-Miss (NM), (Zhang & Mani, 2003) , Condensed Nearest Neighbor (CNN), (Hart, 1968) ), others that select which examples to delete (e.g.,

Tomek Links (TL) undersampling, (Tomek, 1976) , Edited Nearest Neighbors (ENN), (Wilson, 1972) ), and still others that do both (e.g., (Fernández, 2018) , (Laurikkala, 2001) Table 4 , we set the number of neighbors parameter for NM, CNN and ENN to 3; as the CNN, ENN and TL methods did not allow us to explicitly set the ratio of the number of samples to retain in the majority class, only the NM was set to balance the data set with a 20:80 ratio. As the table shows, CNN took more time to resample the data set, removed too many samples from the majority and yielded poor results compared to the others. In contrast, ENN achieved the best results, but it took slightly longer than the NM to resample and did not achieve our desired ratio. We therefore defaulted to the NM, as it reached our sampling goal in a short amount of time while still yielding good results. Figure 3 shows the approximate data distribution before and after undersampling the majority (malicious) class using NM. 

In the λ architecture, we tested our models with and without the "Subdomain levels" feature, as most of the legitimate domain names had at least one subdomain. Moreover, we observed the feature had more importance than all the others after training and testing. Table 5 shows the average fit time, F-1, precision, Accuracy, TPR and TNR of the different classifiers after 10-fold cross-validation without the feature, while Table 6 shows those scores with the feature. One can observe from Tables 5 & 6 that overall, the DTC, RFC, and MLP yielded better scores when the feature was removed, and the situation was reversed for the XGBoost, GBM, and SVM. This outcome shows that the J o u r n a l P r e -p r o o f Journal Pre-proof "Subdomain Levels" feature can have either positive or negative impacts, depending on the model. Therefore, to avoid over-fitting, we have focused more on the scores without the feature in this study, i.e., Table 5 . hypothesis is rejected in those cases (in bold in Table 7 ).

The Hoeffding Tree (HT) classifier was used in the κ architecture for the classification task. As with the λ architecture, we tested the model with and without the "Subdomain levels" features. In real data streams, the number of examples for each class might be changing and evolving. Therefore, the accuracy score is only useful when all classes have the same number of instances. κ is a more delicate measure for evaluating how streaming classifiers perform. The κ statistic was introduced by Cohen (Cohen, 1960) and is defined by:

The quantity p 0 is the classifier's prequential accuracy, and p c is the probability of randomly guessing a correct prediction. If the classifier is always correct, then κ = 1. If the classifier's predictions are similar to random guessing, then κ = 0. Accordingly, we observed the value of κ for every test in addition to the accuracy, precision, recall, and Geometric Mean (G-Mean), measuring how balanced the classification performances were on both the majority and minority J o u r n a l P r e -p r o o f Journal Pre-proof classes, defined by (13). feature highly influenced the results of models trained with streams containing the same level of data distribution, i.e., ST1-ST2, ST3-ST4, and ST5-ST6. This outcome shows that even when instances were learned one at a time, the feature played a critical role in correctly predicting labels.

J o u r n a l P r e -p r o o f Journal Pre-proof

In the batch learning method (λ), the XGBoost algorithm outperformed the other five models (DTC, RFC, GBM, SVM and MLP) because it achieved relatively high and consistent scores in a short amount of time on the data set after 10-fold cross-validation, unlike the other models that seemed to focus on improving one metric at the expense of the others. Regarding the online learning method (κ), after testing the HT Classifier with streams of different levels of data distribution and feature sets, we were able to conclude that the stream ST5 was the best, as the model achieved the highest mean recall, F-1, and precision, compared to the other data streams. Comparing the XGBoost to HTs, without the "subdomain level" feature, reveals that the XGBoost performed better than HTs on streams ST2, ST4, and ST6, although ST2 yielded better accuracy than the XGBoost. This outcome shows that the HT's accuracy kept improving with each sample it trained from at the expense of the other metrics. When the "subdomain level" feature was kept, however, ST3 and ST5 outperformed XGBoost in terms of F-1, precision, and ST1 and ST3 outperformed XGBoost in terms of accuracy. Contrariwise, XGBoost outperformed the rest (ST1, ST3, and ST5) in terms of mean Recall. This result indicates that the data distribution had an impact on the HT, in addition to the "subdomain level" feature, which also impacted the XGBoost's scores.

The results obtained show that we can effectively detect malicious domain names associated to COVID-19. This is a significant outcome given the overwhelming panic caused by the pandemic, which left many individuals vulnerable.

In this paper, we applied two learning methods, batch learning and online learning, to classify COVID-19-related domain names in our data set. The data fast-performing detection scheme to detect malicious domain names related to COVID-19. Indeed, our feature set consisted mainly of lexical features, easily and quickly extracted from the data we already had, and did not require domain lookups, which would slow down the feature extraction steps; moreover, the algorithms we used attained relatively high scores. In addition, we have shown how one feature, namely the number of subdomain levels, can highly influence the performance of the different classifiers in both the batch and online learning frameworks. In the future, we plan on investigating effective feature sets for detecting domain names related to other attack campaigns as misspelled domain names can be generated from other keywords, not only those related to COVID-19. An additional area of interest would be looking at other parts of the URL, such as paths, query strings and fragments, for detection. In the batch learning approach, future works will involve tuning the algorithms for the best hyperparameters and selecting the model suiting our needs to make predictions on unseen data. Those predictions will then first be compared to those made by the Hoeffding Trees pre-trained with different levels of data distributions by using such statistics as the inter-rater reliability, κ, to see how the classifiers agree in the predictions made for each instance. Second, those predictions will be compared with labels from VirusTotal, a source that we consider reliable for this purpose, to properly evaluate each model. Finally, the best model can then be deployed as a browser add-on or plug-in to warn users that they are about to visit a malicious domain and flag those domain names in real-time.

VisualPhishNet: Zero-Day Phishing Website Detection by Visual Similarity

Phishing dynamic evolving neural fuzzy framework for online detection "zero-day" phishing email

A comprehensive survey of AI-enabled phishing attacks detection techniques

A gentle introduction to random forests, ensembles, and performance metrics in a commercial system

Enriching word vectors with subword information

Random forests

Feature selections for the machine learning based detection of phishing websites

Anti-phishing based on automated individual white-list

XGBoost: A scalable tree boosting system

Utilisation of website logo for phishing detection

A coefficient of agreement for nominal scales

Mining high-speed data streams

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining -KDD '00

GoldPhish: Using images for content-based phishing analysis

Texception: A Character/Word-Level Deep Learning Model for Phishing URL Detection

Learning from Imbalanced Data Sets

Machine learning: the art and science of algorithms that make sense of data

Stochastic gradient boosting

Greedy function approximation: A gradient boosting machine

Using automated individual white-list to protect web digital identities

PREDATOR: Proactive recognition and elimination of domain abuse at time-of-registration

The condensed nearest neighbor rule (corresp.)

Probability inequalities for sums of bounded random variables

A novel approach to protect against phishing attacks at client side using auto-updated white-list

Phishing detection: Analysis of visual similarity based approaches. Security and Communication Networks

Towards detection of phishing websites on client-side using machine learning based approach

Num Pages: 687-700 Place

Feature Selections for the Classification of Webpages to Detect Phishing Attacks: A Survey

Improving identification of difficult small classes by balancing class distribution

Understanding the Phishing Ecosystem. Thesis Université d'Ottawa / University of Ottawa

Tranco: A research-oriented top sites ranking hardened against manipulation

Hoeffding races: Accelerating model selection search for classification and function approximation

Intelligent rule-based phishing websites classification

A layout-similarity-based approach for detecting phishing pages

Learning Internal Representations by Error Propagation

Machine learning based phishing detection from URLs. Expert Systems with Applications

Secrets and lies: digital security in a networked world

SeedsMiner: Accurate URL Blacklist-Generation Based on Efficient OSINT Seed Collection

Two modifications of CNN

Frank rosenblatt: Principles of neurodynamics: Perceptrons and the theory of brain mechanisms

Brain Theory

Statistical learning theory. Adaptive and learning systems for signal processing, communications, and control

A survey and classification of web phishing detection schemes. Security and Communication Networks

What's in a URL: Fast feature extraction and malicious URL detection

Asymptotic properties of nearest neighbor rules using edited data

CANTINA+: A feature-rich machine learning framework for detecting phishing web sites

KNN Approach to Unbalanced Data Distributions: A Case Study Involving Information Extraction

Cantina: a content-based approach to detecting phishing web sites

DTOF-ANN: An Artificial Neural Network phishing detection model based on Decision Tree and Optimal Features

We thank the anonymous reviewers and the editor for their constructive comments and suggestions. We are also grateful to VirusTotal for access to their service and ZETAlytics, DomainTools, and PhishLabs for access to their data sets. This research was supported by the Natural Sciences and EngineeringResearch Council of Canada, the Vector Institute, and The IBM Center for

We wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome.We confirm that the manuscript has been read and approved by all named authors and that there are no other persons who satisfied the criteria for authorship but are not listed. We further confirm that the order of authors listed in the manuscript has been approved by all of us.We confirm that we have given due consideration to the protection of intellectual property associated with this work and that there are no impediments to publication, including the timing of publication, with respect to intellectual property. In so doing we confirm that we have followed the regulations of our institutions concerning intellectual property.We understand that the Corresponding Author is the sole contact for the Editorial process (including Editorial Manager and direct communications with the office). He is responsible for communicating with the other authors about progress, submissions of revisions and final approval of proofs. We confirm that we have provided a current, correct email address which is accessible by the Corresponding Author.Signed by all authors as follows: