key: cord-0261920-t10hlaoi authors: Pritom, Mir Mehedi Ahsan; Schweitzer, Kristin M.; Bateman, Raymond M.; Xu, Min; Xu, Shouhuai title: Data-Driven Characterization and Detection of COVID-19 Themed Malicious Websites date: 2021-02-25 journal: nan DOI: 10.1109/isi49825.2020.9280522 sha: fe5b48de023c4b6c6b144a0443531f23a602d720 doc_id: 261920 cord_uid: t10hlaoi COVID-19 has hit hard on the global community, and organizations are working diligently to cope with the new norm of"work from home". However, the volume of remote work is unprecedented and creates opportunities for cyber attackers to penetrate home computers. Attackers have been leveraging websites with COVID-19 related names, dubbed COVID-19 themed malicious websites. These websites mostly contain false information, fake forms, fraudulent payments, scams, or malicious payloads to steal sensitive information or infect victims' computers. In this paper, we present a data-driven study on characterizing and detecting COVID-19 themed malicious websites. Our characterization study shows that attackers are agile and are deceptively crafty in designing geolocation targeted websites, often leveraging popular domain registrars and top-level domains. Our detection study shows that the Random Forest classifier can detect COVID-19 themed malicious websites based on the lexical and WHOIS features defined in this paper, achieving a 98% accuracy and 2.7% false-positive rate. The COVID-19 pandemic has incurred many new cyber attack vectors. Many of these cyber attacks incorporate COVID-19 themed factors into phishing, malware, and scamming schemes for various malicious goals (e.g., monetary benefits, stealing credentials, stealing credit card numbers, or identity theft). For example, there is reportedly a 148% increase in ransomware attacks in March 2020 compared with February 2020 [1] , where many attacks are initiated by malicious websites abusing victims' trust. This paper focuses on one emerging attack vector, namely malicious websites leveraging COVID-19 as a theme or COVID-19 themed malicious websites [2] . As organizations incorporate the "work from home" policy, the consequences of COVID-19 themed malicious websites can be significantly amplified because home computers are often more vulnerable to attack than work computers. During the COVID-19 pandemic, many people lost their jobs and are affected by mental health issues, which causes excessive pressures. These pressures may make average users even more vulnerable to social engineering attacks waged via COVID-19 themed malicious websites. This increases the motivation of the importance of understanding and defending against COVID-19 themed malicious websites, which is a new problem that has not been studied before in a systematic way. Our contributions. In this paper, make the following contributions. First, we propose a methodology for characterizing and detecting COVID-19 themed malicious websites through a data-driven approach. To the best of our knowledge, this is the first study on data-driven characterization and detection of COVID-19 themed malicious websites. Second, we apply the methodology to specific datasets to draw the following insights: (i) some attackers may be incentivized to use cheaper registrars for registering COVID-19 themed malicious websites; (ii) attackers often abuse popular top-level domains for their COVID-19 themed malicious websites; (iii) attackers are agile in waging the COVID-19 themed malicious website attack; (iv) attackers are crafty in using COVID-19 themed keywords, and geographical information in creating COVID-19 themed malicious website domain names; (v) the small degree of data imbalance does not have any significant impact in the effectiveness of detecting COVID-19 themed malicious websites; and (vi) COVID-19 themed malicious website detectors must consider WHOIS features and Random Forest performs better than K-nearest neighbor, decision tree, logistic regression, and support vector machine. Paper outline. The rest of the paper is organized as follows. Section II explores the related work. Section III explores the research questions which guide us to characterize and detect COVID-19 themed malicious websites. Section IV reports the experiments and results. Section V discuss our weakness and future research opportunities. Section VI concludes the paper. Although the problem of COVID-19 themed malicious websites has not been investigated until now, the problem of malicious websites has been studied in the literature prior to the COVID-19 pandemic. The problem of detecting malicious URLs generated by domain generating algorithms has been investigated in [3] . The problem of detecting phishing websites has been addressed via various approaches, including: the descriptive features-based model [4] , the lexical and HTML features-based model [5] , the HTML and URL features-based model [6] , and the natural language processing and word vector features-based model [7] . The problem of detecting malicious websites has been addressed via the following approaches: leveraging application and network layers information [8] , leveraging image recognition [9] , leveraging generic URL features [10] , [11] , leveraging character-level embedding or keyword-based recurrent neural networks [12] - [14] , the notion of adversarial malicious website detection [15] . However, these studies do not consider features pertinent to the COVID-19 pandemic, which are we leverage. Nevertheless, the present study fall under the umbrella of cybersecurity data analytics [16] - [20] , which in turn belong to the Cybersecurity Dynamics framework [21] - [25] . Our methodology for data-driven characterization and detection of COVID-19 themed malicious websites is centered at answering a range of research questions. In order to characterize COVID-19 themed malicious websites, we address 4 Research Questions (RQs): [15] . Answering the preceding questions will deepen our understanding of COVID-19 themed malicious website attacks. We propose leveraging machine learning to detect COVID-19 themed malicious websites and answer: • RQ5: Which classifier is competent in detecting COVID-19 themed malicious websites? • RQ6: What is the impact of WHOIS features on the classifier's effectiveness? In order to answer these questions, we need to train detectors. Figure 1 highlights the methodology for detecting COVID-19 themed malicious websites. The methodology can be decomposed into the following modules: data collection, feature definition and extraction, data pre-processing, classifier training, and classifier test. Data about websites need to be collected from reliable sources. The collected data may need enrichment to provide more information, as what will be illustrated in our case study. Then, features may be defined to describe these websites. In the case of using deep learning (which requires much larger datasets), features may be automatically learned. One may consider a range of classifiers, which are generically called C i 's in Figure 1 . As shown in Figure 1 , one can use classifiers individually or an ensemble of them (e.g., via a desired voting scheme, such as weighted vs. unweighted majority voting). In the simple form of unweighted majority voting, a website is classified as malicious if majority of the classifiers predict it as malicious; otherwise, it is classified as benign. In order to evaluate the effectiveness of the trained classifiers, we propose adopting the standard metrics, including: accuracy (ACC), false-positive rates (FPR), false-negative rates (FNR), and F 1-score. Specifically, let T P be the number of true positives, T N be the number of true negatives, F P be the number of false positives, and F N be the number of false negatives. Then, we have ACC = Our case study applies the methodology to specific datasets. Our dataset of COVID-19 malicious website examples are obtained from what was published between 2/1/2020 and 5/15/2020 by two sources: (i) CheckPhish [26] , which contains 131,761 malicious websites waging scamming attacks related to COVID-19; and (ii) DomainTools [27] , which contains 157,579 malicious websites waging malware, phishing, and spamming attacks related to COVID-19. The union of these two sets leads to a total of 221,921 malicious websites, denoted by D malicious , owing to the fact that 67,419 websites belong to both sets. For obtaining benign websites, we use the top 250,000 websites from Cisco's Umbrella 1 million websites dataset [28] on 05/16/2020, denoted by D benign , which is a source of reputable websites. We compile a merged dataset denoted by D initial = D malicious ∪ D benign . In order to collect WHOIS information of a website, we use the python library whois 0.9.7 to query the WHOIS database on 8/7/2020. We observe that 42,540 (or 19.17%) out of the 221,921 malicious websites have no WHOIS information available, and 93,082 (or 37.2%) out of the 250,000 benign websites have no WHOIS information available. This means that the presence/absence of WHOIS information does not indicate that a website is malicious or not. B. Characterization Case Study 1) Answering RQ1: Identifying the WHOIS registrars that are most abused to launch COVID-19 themed malicious websites: For this purpose, we use a subset of D malicious set, denoted by D malicious , which contains 171,901 malicious websites with WHOIS registrar name information available. We observe that Godaddy is the most frequently abused registrar, followed by Google and Namecheap. This finding inspires us to analyze if there is any financial incentive behind the use of a specific registrar. The cost registering a .com domain in the first year, is: Godaddy for $11.99, Google for $9, Namecheap for $8.88, Dynadot for $8.99, 1&1 for $1, name.com for $8.99, PDR Ltd for $35, OVH for $8.28, Alibaba for $7.99, Reg-ru for $28. This suggests that some attackers might have considered registrar 1&1 because it is the cheapest, while some attackers use reputed registrars. Insight 1: Some attackers may be incentivized to use cheaper registrars but some of the other don't. 2) Answering RQ2: Which Top Level Domains (TLDs) are most abused by COVID-19 themed malicious websites?: In order to answer this question, we use the original dataset D malicious , which contains 221,921 COVID-19 themed malicious websites with corresponding TLD information. Figure 3 depicts the top 10 abused TLDs, which are ranked according to the absolute number TLDs for COVID-19 themed malicious websites. We make the following observations. First, .com hosts the highest number of malicious websites, followed by .org and .net. Second, 5 of the top 10 abused TLDs correspond to country-level ccTLDs, including .de, .uk, .ru, .nl and .eu. Insight 2: Attackers often abuse popular TLDs. 3) Answering RQ3: What trends are exhibited by COVID-19 themed malicious websites?: In order to answer this question, we use the dataset D malicious mentioned above. Figure 4 depicts the trend of malicious websites, leading to two observations. First, there is a discrepancy between the daily numbers of websites that are reported by the two sources. According to CheckPhish, the number of COVID-19 themed malicious websites reaches the peak on 03/25/2020, with 18,495 malicious websites; according to DomainTools, the number of COVID-19 themed malicious websites reaches a peak on 03/20/2020, with 3,981 malicious websites. This data indicates that there are reporting inconsistencies among sources and many COVID-19 themed malicious websites are created at the early stage of the pandemic when uncertainties are maximum. Second, the number of COVID-19 themed malicious websites, by and large, has been decreasing since the last week of March 2020 (i.e., two weeks after the pandemic declaration), leading to about 1,000 websites per day during the first week of May 2020 (i.e., about two months after pandemic declaration). However, there is still oscillation. One possible cause is that the attackers have been waiting to create new COVID-19 themed malicious websites based on the pandemic's new developments (e.g., vaccine). In order to answer this question, we analyze the dataset D malicious mentioned above. We use the python library wordninja with English Wikipedia language model [29] to split domain name strings and extract COVID-19 themed keywords. We observe that 4 keywords (i.e., covid, corona, covid19, and coronavirus) are most widely used as expected; they are followed by mask, quarantine, virus, test, facemask, pandemic, and vaccine. We extract more than 19,000 keywords. A further analysis of the domain names reveals that attackers create COVID-19 themed malicious websites with names containing geographical attributes. For example, coronaviruspreventionsanantonio.com, coronavirusprecentionhouston.com, and coronaviruspreventiondallas.com use a combination of city name and a COVID-19 themed keyword. Moreover, we observe the existence of COVID-19 themed "parking" websites, which have no content at the present time but might be used for upcoming COVID-19 themes. Insight 4: Attackers are crafty in using COVID-19 themed keywords and geographical information in creating COVID-19 themed malicious website domain names. Given D initial , the detection case study proceeds as follows. 1) Feature Definition and Extraction: We define features according to the following aspects of websites: WHOIS (F1-F4), domain name lexical information (F5-F9), statistical information (F10), and Top-Level Domain or TLD (F11). • Current WHOIS registration lifetime (F1): This is the number of days that has passed since a website's registration, with respect to the date when this feature's value is extracted (e.g., 08/07/2020 in our case). • Remaining WHOIS expiration lifetime (F2): This is the number of remaining days before a website's WHOIS registration expires, with respect to the date when this feature's value is extracted (e.g., 08/07/2020 in our case). • Number of days since last WHOIS update (F3): This is the number of days elapsed since a website's last update with respect to the date when this feature's value is extracted (e.g., 08/07/2020 in our case). This is the total number of unique alphabetic and numeric characters (i.e., a-z, A-Z, 0-9) in a domain name. • Domain entropy (F10): This is the Shannon entropy [30] of the domain name (i.e., a kind of statistical information), which is computed based on the frequency of characters in the domain name. • TLD Reputation (F11): We propose measuring a TLD's reputation as m |D benign | , where m is the number of websites in D benign that contain this particular TLD. 2) Data Pre-Processing: Given that some websites may not have information for the features, it is important to consider different scenarios. In our example, we propose considering two datasets that can be derived from D initial because some websites do not have information for the WHOIS features. • Dataset D 1 ⊂ D initial consists of websites for which WHOIS information is available (i.e., features F1-F4 are available Since only D 1 contains all WHOIS information, We use it for feature selection study. For this purpose, we use the random forest classification feature importance method [31] (with the 80-20 splitting of training-test data) to find the important features. Table I depicts the relative importance of the features in D 1 . We observe that F6 and F8 have a very small relative importance (i.e., < 0.01) when compared to the others, suggesting that hyphens and digits are equally used in malicious or benign domain names. Hence, we will eliminate F6 and F8 in the rest study of D 1 . In order to see whether or not the feature selection result is impacted by the data imbalance of D 1 (with the malicious:benign ratio being 3.1:1), we explore two widelyused methods: (i) oversampling the minority class to replicate some random examples; and (ii) undersampling the majority class to remove some random examples. At first, we do the 80-20 splitting of training-test data, and then change the malicious:benign ratio in the training set, while keeping the test set intact. We wish to identify the ratio that achieves the highest F 1-score. In what follows we only report the results of Random Forest because it outperforms the other classifiers for the original dataset D 1 . Table II shows the impacts of the malicious:benign ratio in the training set. We observe that the oversampling-incurred ratio 1.67:1 leads to the highest F 1-score (and the second best FPR and lowest FNR), while undersampling never performs better than the original data ratio in terms of accuracy and F 1-score. This can be explained by the fact that the latter eliminates useful information. This prompts us to use oversampling to achieve the 1.67:1 ratio when training classifiers, which turns D 1 into D 1 (i.e., the training set is augmented). Figure 5 further highlights the confusion matrix of the experiment one the same test set but corresponding to D 1 and D 1 , which shows a slight improvement in detection when augmenting the training set with oversampling. Insight 5: The data imbalance issue does not affect the model performance significantly in this case, perhaps because the degree of imbalance is not severe enough. 3) Training and Test: Having addressed the issue of feature selection and data imbalance, we consider the following classifiers: Random Forest (RF), Decision Tree (DT), Logistic Regression (LR), K-Nearest Neighbor (KNN), and Support Vector Machine (SVM). Specifically, we use the python sklearn module to import the following classifier algorithms: (i) Random Forest or RF with parameters n_estimator=100 (i.e., 100 trees in a forest) and criterion='entropy' (i.e., entropy is used to measure information gain); (ii) K-Nearest Neighbor or KNN, with parameters n_neighbors=8 (i.e., 8 of neighbors are considered), metric='minkowski' with p = 2 (i.e., the Minkowski metric with p = 2 measures the distance between two feature vectors), and the rest parameters are the default values; (iii) Decision Tree or DT with default parameters; (iv) Logistic Regression or LR with default parameters; (v) Support Vector Machine or SVM with linear kernel and other default parameters. For voting the outputs of the five classifiers mentioned above, we use the VotingClassifier() function and set voting='hard' (i.e., majority voting). We always considering the 80-20 splitting of the scaled training-test data. In order to answer RQ5 and RQ6, we conduct the following experiments, where we use the 80-20 train-test splitting of D 1 and then augmenting the training set as mentioned above. Our experiments are conducted on a virtual machine on https://www.chameleoncloud.org/, running CentOS 7 on a machine of an x86 64 processor with 48 cores and CPU frequency 3.1 GHz. Table III summarizes the experimental results with a range of classifiers and the actual time spent on training a model and classifying the entire test set. We make several observations. First, for a specific classifier, using WHOIS features alone (Exp. 2) almost always leads to significantly higher effectiveness than using lexical features alone (Exp. 1), except for Logistic Regression. Second, for a fixed classifier, using both lexical and WHOIS features together (i.e., Exp. 3) always performs better than using lexical or WHOIS features alone. Third, among the classifiers considered, Random Forest performs the best in every metric in each experiment. In particular, Random Forest (i.e., non-linear classifier) achieves a better performance than the Ensemble method because there are classifiers (e.g., Logistic Regression and SVM) that are substantially less accurate than the other classifiers and therefore "hurt" the voting results. Fourth, Decision Tree has the fastest execution time, followed by KNN and Random Forest, while Logistic Regression is the slowest and causes a delay for the voting ensemble. To understand the generalizability, when conducting Exp. 1 on the augmented D 2 with the benign:malicious ratio at 1.25:1, we observe that Random Forest outperforms other models by achieving a 0.947 accuracy, a 0.066 FPR, a 0.041 FNR, and a 0.947 F1-score. Insight 6: COVID-19 themed malicious website detectors must consider WHOIS features; and Random Forest performs the best among the classifiers that are considered. The present study has several limitations, which should be addressed in future studies. First, we use a heuristic method to determine the ground truth. This heuristic method can only approximate the ground truth because the data sources (i.e., CheckPhish and DomainTools feeds in this case) may contain some errors. Second, we could not avoid the data imbalance problem, meaning that the resulting detectors or classifiers may be slightly biased towards the majority class even after the oversampling. Third, we only considered the WHOIS and URL lexical features, but not the website contents or the network layer features, Fourth, we only considered five WHOIS features because most of the other kinds of WHOIS information are largely missing, which means that WHOIS registrars need to collect more detailed information than what is presented at the moment of writing. Fifth, application of deep learning models or explainable ML are left to future research. Sixth, we observe that the python library wordninja can make bad splits at times (e.g., when a domain name is seemingly in English characters but actually in another languages). We have presented the first systematic study on datadriven characterization, and detection of COVID-19 themed malicious websites. We presented a methodology and applied it to a specific dataset. Our experiments led to several insights, highlighting that attackers are agile, crafty, economically incentivized in waging COVID-19 themed malicious websites attacks. Our experiments show that Random Forest can serve as an effective detector against these attacks, especially when WHOIS information about websites in question is available. This highlights the importance of domain registrars to collect more information when registering domains in future. Amid covid-19, global orgs see a 148% spike in ransomware attacks; finance industry heavily targeted Spotting and blacklisting malicious covid-19-themed sites Using deep learning to detect malicious urls Phishing url detection through top-level domain analysis: A descriptive approach Detecting phishing websites through deep reinforcement learning A stacking model using url and html features for phishing webpage detection Machine learning based phishing detection from urls Cross-layer detection of malicious websites Malicious websites detection via cnn based screenshot recognition* Learning to detect malicious urls Identifying generic features for malicious url detection system What's in a url: Fast feature extraction and malicious url detection MALICIOUS URL DETECTION USING CONVOLUTIONAL NEURAL NETWORK Detecting malicious urls via a keywordbased convolutional gated-recurrent-unit neural network An evasion and counter-evasion study in malicious websites detection Metrics towards measuring cyber agility Characterizing honeypot-captured cyber attacks: Statistical framework and case study Predicting cyber attack rates with extreme values Spatiotemporal patterns and predictability of cyberattacks Modeling and predicting cyber hacking breaches Cybersecurity dynamics: A foundation for the science of cybersecurity Preventive and reactive cyber defense dynamics is globally stable Quantifying the security effectiveness of firewalls and dmzs A survey on systems security metrics Quantifying the security effectiveness of network diversity Covid-19 (coronavirus) phishing & scam tracker Free covid-19 threat list -domain risk assessments for coronavirus threats Cisco umbrella 1 million Shannon entropy Applied predictive modeling Acknowledgement. We thank the reviewers for their useful comments. This work was supported in part by ARO Grant #W911NF-17-1-0566, ARL Grant #W911NF-17-2-0127, and the NSA OnRamp II program.