key: cord-0045864-aqm9gjf3
authors: Liu, Zhicheng; Li, Shuhao; Zhang, Yongzheng; Yun, Xiaochun; Peng, Chengwei
title: Ringer: Systematic Mining of Malicious Domains by Dynamic Graph Convolutional Network
date: 2020-05-22
journal: Computational Science - ICCS 2020
DOI: 10.1007/978-3-030-50420-5_28
sha: b07510f7bcd8eaf39ed872a5ceabe6f34292351c
doc_id: 45864
cord_uid: aqm9gjf3

Malicious domains are critical resources in network security, behind which attackers hide malware to launch the malicious attacks. Therefore, blocking malicious domains is the most effective and practical way to combat and reduce hostile activities. There are three limitations in previous methods over domain classification: (1) solely based on local domain features which tend to be not robust enough; (2) lack of a large number of ground truth for model-training to get high accuracy; (3) statically learning on graph which is not scalable. In this paper, we present Ringer, a scalable method to detect malicious domains by dynamic Graph Convolutional Network (GCN). Ringer first uses querying behaviors or domain-IP resolutions to construct domain graphs, on which the dynamic GCN is leveraged to learn the node representations that integrate both information about node features and graph structure. And then, these high-quality representations are further fed to the full-connected neural network for domain classification. Notably, instead of global statically learning, we adopt time-based hash to cut graphs to small ones and inductively learn the embedding of nodes according to selectively sampling neighbors. We construct a series of experiments on a large ISP over two days and compare it with state of the arts. The results demonstrate that Ringer achieves excellent performance with a high accuracy of 96.8% on average. Additionally, we find thousands of potential malicious domains by semi-supervised learning.

Malicious domains are important platforms of launching malicious attacks in network security, such as spamming, phishing, botnet command and control (C2) infrastructure and so on. The cost of dollars is rising as a result of the growing prevalence of financial fraud based on domains. For instance, some ransomwares encrypt personal files through domain dissemination to extort money from individuals, which even cause more emotional and professional damage. Moreover, personal privacy and intellectual property rights caused by malicious domains are also arguably serious issues. Therefore, effective detection of malicious domains bears the utmost importance in fighting malwares.

Since blocking malicious domains can immediately lead to reduction in malwares, a wealth of techniques have been proposed to discover malicious domains. Those efforts can be divided into two categories, including feature-based and behavior-based methods. Traditional feature-based methods [9] [10] [11] 26] basically use the network or domain name features to distinguish malicious domains from benign ones due to its efficiency and effectiveness. However, attackers can circumvent the detections through some simple manipulation on these features such as making the pronounceable feature conform to the normal domains' form. Advancements in machine learning make it possible to achieve superior performance on domain classification over the passive DNS statistical features. However, it hinders adversarial crafting data generated by the Generative Adversarial Network (GAN) [8] and needs a lot of labeled data with time-consuming manual inspections.

Behavior-based methods [17, 19, 21] make use of the querying relationship of the hosts or the parsing relationship of the domains to build the association graphs in which domains or hosts represent nodes. Some inference algorithms (e.g., Belief Propagation [31] ) are leveraged to infer the reputation of unlabeled domains on the given labeled domains. However, they achieve poor precision when the number of ground truth is small and do nothing to the isolated nodes that have no relationship with the ground truth. Moreover, some methods [20, 25] based on analyzing host behaviors are prone to evasion techniques such as fakequerying and sub-grouping.

In order to solve the limitations of the existing methods, we propose Ringer, a novel and effective system to classify malicious domains. Our insights are built on three observations: (1) Malicious domains often serve malicious users whose behaviors deviate from that of benign domains usually registered for benign services. Hence, it is very likely that multiple malicious domains are accessed by the overlapping client set; (2) Multiple malicious domains are commonly hosted on the joint server hosts due to the fact that malicious IPs are important and scarce resources which are generally reused. For example, attackers would typically place a number of rogue softwares on the same server hosts that they control to reduce the cost. Once identified, the malwares will be migrated to another host in chunks. The reuse mechanism and common querying behaviors of these malicious resources reflect the strong correlation among malicious domains;

(3) Graph convolutional network (GCN) [18] can aggregate information of both graph structure and static features to generate the underlying representation of nodes for classification tasks. In turn, we use these two strong associations, combined with advanced graph deep learning technique, to discover malicious domains.

Ringer is designed to model the association among domains into graphs, on which dynamic GCN algorithm is applied to learn the high-quality node representations that are amenable to efficient domain classification. Instead of analyzing single domain in isolation, we study the relevance of multiple domains involved in malicious activities. The robustness of the system is enhanced owing to the unchangeable behavior relationship and global correlation. Both of them will not be eliminated or altered by the will of the human being. More specifically, the detection of malicious domains by Ringer consists of three main steps. Firstly, Ringer constructs the domain association graph according to the querying behavior and resolution relation. And then, Ringer takes advantage of dynamic GCN to quickly aggregate attribute information associated with each vertex through the neighborhood defined by the graph structure in sampling fashion, thus transferring the graph-represented domains to vectors. Finally, the node representations that embed high-dimensional structural information, as well as attributes, are fed to neural network for malicious domain mining. There are two challenges that need to be addressed: (1) the existing GCN operates on the full static graphs, which cannot be directly applied to large-scale graphs (especially our internetscale domain data); (2) it is time-consuming to learn the entire graph every time as the graph is constantly evolving. Therefore, we introduce two innovations: (1) hashing the large graphs to form small ones according to time; (2) sampling the neighborhoods of each node to construct the graph dynamically and making the training of the model dependent only on the node and its sampled neighbors, but the entire input graphs. To improve performance, we simultaneously aggregate all intermediate representations using layer-wise attention encoder.

Our contributions are summarized as follows:

-We propose a robust malicious domain classification method named Ringer, which transforms the research problem from malicious domain mining to graph learning. We use time-based hash technology to split the graphs into small ones, which makes Ringer more scalable. -Dynamic GCN is developed to automatically capture the structure and characteristics information of the sampled neighbors, thus generating the underlying representations for further domain classification. -Our method has strong generalizability, which is independent of the input graph. Once the model is trained, it can be directly applied to the new domain association graphs or the newly-added nodes. -We implement the prototype of Ringer and perform experiments on largescale DNS traces from ISP to evaluate the effectiveness. The results show that our system has superior scalability and accuracy than state-of-the-art system, and a large number of potential malicious domains are found.

The remaining sections of this paper are organized as following. In Sect. 2, we review the related work. In Sect. 3, we describe the association among malicious domains and the original GCN. We provide a systematic overview of Ringer and detail each module in Sect. 4. The collection of the datasets and ground truth used in this paper is elaborated in Sect. 5. We highlight the experimental results and discuss the limitations of our systems in Sect. 6, while we conclude the whole study in Sect. 7.

Malicious domain detection has been a hot issue and a great deal of strides have been made to discover malicious domains over the past years. The detection on DNS can dig out the malicious activities earlier due to the nature of DNS flow prior to attack traffic. DNS-based detection is also lighter than that based on all traffic and has no need to take into account the traffic encryption. Here, we present some representative works which can be divided into two categories: feature-based and behavior-based methods.

Feature-based methods adopt advanced machine learning combined with domain statistical features to detect malicious domains. Antonakakis et al. [9] put forward Notos, a system uses different characteristics of a domain to compute a reputation that indicates the domain malicious or legitimate. Bilge et al. [11] propose EXPOSURE, which uses fewer training datasets and less time to detect malicious domains. Moreover, the system is able to detect the malicious domains of the new categories without updating the data feeds. Babak et al. [25] present the Segugio, a system which tracks the co-occurrence of malicious domains by constructing a machine-domain bipartite graph to detect new malware-related domains. Antonakakis et al. [10] use the similarity of Non-Existent Domain responses generated by Domain Generation Algorithm (DGA) and querying behavior to identify DGArelated domains. Jehyun et al. [19] build domain travel graphs that represent DNS query sequences to detect infected clients and malicious domains which is robust with respect to space and time. Thomas et al. [28] analyze the NXDomains querying pattern on several TLD authoritative name servers to identify the strong connected groups of malware related domains.

Behavior-based approaches are to detect malicious domains by analyzing host behavior or network behavior. Pratyusa et al. [21] firstly use proxy logs to construct host-domain bipartite graphs on which belief propagation algorithm is used to evaluate the malicious probability of domains. Issa et al. [17] construct domain-resolution graph to find new malicious domains by using strong correlation between domain resolution values. Acar et al. [27] introduce a system AESOP, which uses a locality-sensitive hashing to measure the strength of the relationship between these files to construct a graph that is used to propagate information from a tagged file to an untagged file. Liu et al. [20] propose CCGA which first clusters NXdomains according to the hosts infected family DGA generate the same domain name set, and then extracts the cooperative behavior relationship of domains (including time, space and character similarity) to realize the classification of NXdomain cluster. Peng et al. [23] design a system MalShoot, which construct domain-resolution graph to learn the relationship among domains by using graph embedding technology. They use the node embeddings as features to identify malicious domains and defend illegal activities.

In this section, we first describe malicious domain correlations used in malicious domain mining. Then we discuss the technique of interest GCN.

Our method is to use the relationship among domains combined with statistical characteristics for domain classification. It is suitable for detecting multi-domain malicious activities, which complements the detection of single-domain malicious activities. The associations among malicious domains mainly fall into two categories: client similarity and resolution similarity. Client Similarity. Multiple malicious domains are accessed by a collection of overlapping clients, which is the client similarity. This similarity is determined by infected malware for the fact that the hosts infected with the same malware may have the same list of querying domains or seeds. In addition, they have the same attack strategy: get instructions from one domain, then download or update malware from another domain. Normal and malicious domains generally have no intersection on the client side. However, there are some exceptions. For example, some malwares now use fake-querying technique to query some benign domains deliberately which interfere with the detections. Fortunately, these benign domains have obvious recognizable features, such as being accessed by a large number of hosts. In that case, they can be easily removed.

A plurality of malicious domains are hosted on the same malicious IPs which constitutes the resolution similarity. Resolution sharing reveals the essential association among domains which is not changed by the will of people. This is due to the limitation of malicious IP resources and the flexibility of multi-domain malware. Malicious IPs, as pivotal resources available for attackers, are very small in quantity. While, many malwares need to use multiple domains to evade detection, improve the usability, attacking ability or survivability. Given these points, multiple malicious domains are hosted on the same set of IPs controlled or maintained by attackers. For example, once a domain is blacklisted, malwares based on DGA will generate other domains which are also resolved to the same IP.

The convolution in Convolutional Neural Network (CNN) essentially uses a shared parameter filter (kernel) to extract spatial features by calculating the weighted sum of central points and adjacent points. GCN is a kind of deep learning technology which applies CNN technology to graph. Given an attributed 

where V is a set of nodes with the size of M and E is a set of edges. It is also assumed that X ∈ R M ×N is the feature matrix with each node v ∈ V endued N -dimensional attributes, and A ∈ R M ×M is the adjacency matrix in which A i,j = a i,j if there is an edge e =< i, j > with the corresponding weight a i,j , otherwise A i,j = 0. Kipf et al. [18] give the layer-wise propagation rule of multi-layer GCN:

When we only take into account 2-layer GCN, the forward model becomes the following form:

Here, W (0) ∈ R M ×H is an input-to-hidden weight matrix with H feature maps in hidden layer. W (1) ∈ R H×F is a hidden-to-output weight matrix assuming that the output has the F categories. Softmax activation function ranges each element of a vector x to [0, 1] and the sum of all elements to 1 with definition (xi) . We evaluate the cross-entropy loss function over the labeled samples as:

Where y L is the indices of nodes that have labels in semi-supervised multi-class classification. The weight matrix W (0) and W (1) are trained by using gradient descent method to minimize the loss function (3). GCN simultaneously learns the structure and attribute information, which have achieved great success in clustering and classification tasks [30, 32] . However, from formula (1), it can be seen that the output of each node in the current layer needs the attribute support of its neighbors from the previous layer, and with the increase of the number of layers, the number of support required by each node increases in exponential explosion. Obviously, it is only suitable for working on relatively small graphs, not for large-scale domain data such as ours. Some works [12, 15] provide a perspective on the effectiveness of sampling neighbors.

The goal of Ringer is to discover domains involved in malicious activities by analyzing passive DNS traffic (traces). As shown in the Fig. 1 , the system architecture of Ringer consists of three modules: preprocessing, graph construction and dynamic GCN. In order to better describe our research, we introduce some notations listed in Table 1 .

The preprocessing module consists of two parts: time-based hash and noise filtering.

Time-Based Hash. The goal of time-based hash is to divide the big graphs into small ones according to the granularity of the time. Typically, domain-based malicious behaviors are generally relatively continuous and have short duration, such as domains generated by DGA. Intuitively, if the shared similarity between two domains happens within a short period of time, then their relevance will be strong. Relatively, if a period of time is very long, it has high probability to arise shared similarity for various reasons (such as server migration), which would introduce weak correlations. For example, the relevance intensity of two domains that have shared hosts within an hour and distinct weeks is different. Therefore, the length of the graph composition time is going to affect TPs and FPs. If the time is too short, the association between domains will not show up. If the time is too long, a large number of weak correlations will be introduced. Due to the page constraint, we will discuss the impact of time on performance in future studies. In this paper, we refer to [10] and empirically select the timespans as one hour. The time-based hash step is shown in the Fig. 1 . Firstly, all records are hash to different time buckets according to the timestamp. Then, the noise filtering is used to remove some noises for each bucket. Finally, we construct the graphs in each hash bucket for the succedent operation.

Noise Filtering. Noise filtering is designed to remove the noises introduced by some normal users as well as reduce the FPs. We adopt two strategies to filter DNS traffic. Firstly, we remove domains based on the client degree. We define the client degree of a domain as the number of clients that query the domain at a given epoch. The more clients the domain is queried by, the more popular the domain is and the less likely it is to be malicious. We remove the domains queried by more than N clients. We will discuss the selection of the threshold N later. Similarly, we can discuss the resolution degree, remove the domains with more than M resolution values. In this way, we can remove some public domains, such as content delivery networks (CDN), cloud and other domains for public services. And then, we further get rid of domains generated by some normal applications such as BitTorrent due to the fact that they have obvious characteristics, for example, ending with 'tracker' or 'DHCP'.

Ringer uses graph representation to express domains as the nodes and the relationship between two domains as an edge. If two domains have shared clients (resolutions) within a given period of time (in one hash bucket), then an edge exists between the corresponding nodes in the graph. The weight of the edge is the number of shared clients (resolutions) of the two domains. The graph construction algorithm is summarized as Algorithm 1. In order to save space, we store the attributes of nodes and edges as files. We can get them from the file system whenever we need. As a result, the graph construction process outputs of the correlation graph of domains and the graph-related property files. The construction of the graph aggregates the DNS query information on the graph in a cumulative manner, which makes the whole system more robust and scalable.

Dynamic GCN is the key part of our method, which takes graph structure and attributes associated nodes as input to generate high-quality embedding for each node. These embeddings are then fed to full-connected neural network for domain classification. In this section, we discuss the technical details of Dynamic GCN.

Dynamic GCN uses localized graph convolution to generate embeddings from selectively sampled neighbors. The specific forward propagation procedure is shown as Algorithm 2.

More specifically, in order to generate the embedding of node v, we only need to consider the input features and the graph neighbors N v = {u|(u, v) ∈ E}. We first selectionly sample neighbors depending on the weight of the edges. Then, we introduce node-wise attention aggregator to aggregate neighbors in attention 

fashion. Next, we concate the aggregation from the neighbors with the current representation of node to form the intermediate representation that will be encoded into a new node embedding; The output of the algorithm is the node representation that incorporates both information itself and the neighbors, on which we use the L2 normalization to prevent the model from over-fitting and make the training more stable. Finally, layer-wise attention module is implemented to encode embeddings generated by different layers for enhancement of representation learning.

Instead of random sampling, we sample neighbors according to the weight of the edges. We deem to that the larger the weight of the edge is, the stronger the correlation between the two points is. As a consequence, we sample the top m nodes for each node with the largest weights as neighbors. It is important to set neighbor parameters for later computational models when we take into account the fixed number of nodes. Moreover, we can control the use of memory footprint during training with the certain nodes selected to aggregate.

With neighbor context, we can use GCN to aggregate neighbor features. However, neighbors have different influence on the target node due to the different relationship strength between nodes or some noises. Therefore, we propose node-wise attention aggregator which uses a weighted approach to aggregate neighbor nodes. As shown in Fig. 2 (left), we use the normalized weights as the attention weights which is in accordance with their relationship intensities. That is, for neighbors N v = {u 1 , u 2 , ..., u m } with the corresponding weights {w 1 , w 2 , ..., w m }, we can compute attention weights w1 L , w2 L , ..., wm L and L = |w 1 | + |w 2 | + ... + |w m | is L1 normalization of weights. Compared with the original GCN, our specific node-wise attention aggregator formula is shown as follows:

(4) Here, k denotes the layer index k; w and b are trainable parameters of weights and bias respectively; z 0 v is initialized by x v . 

Training Details. We train Ringer in a semi-supervised way. We define loss functions such as formula (3). Our goal is to optimize the parameters so that the output labels produced by model is close to the ground truth in labeled dataset. We use mini-batch gradient descent to train for each iteration, which make it fit in memory.

ISP Dataset. To verify the effectiveness of our scheme, we collect DNS traces in an ISP recursive server within two days from November 4th to November 5th, 2017. Many steps have been taken by our data provider to eliminate privacy risks for the network users. Due to the limitation of storage space and computing resources, we uniformly sample the data with the scale of 1/10. Previous work [22] has proved that uniform sampling makes it effective to evaluate and extrapolate the key attributes of the original dataset from the samples. This billion-level data contains 1.1 billion pieces of data per hour, which contains various types of DNS records. Our experiment is based only on the A records, thus ensuring the existence of domain-to-IP resolution except Non-Existent Domains (NXdomains). After time-based hash operation, our data is cut into 48 portions according to the time series. To better illustrate our data, we set forth the overview of dataset (shown as Fig. 3 ) in one bucket from 1:00 clock to 2:00 on November 4th, 2017, which only retains domains on A record. We can see that the distribution of querying host and resolution are heavy-tailed, with a few domains being accessed by a large number of hosts or resolved to a large number resolution IPs.

Ground Truth. We have collected 21558 distinct second-level domains from a number of authoritative blacklists, including malwaredomains.com [2] , Zeustracker [7], malwaredomainlist.com [5] , malc0de.com [4] , bambenek consulting [1] . We also use all DGA data from DGArchive [24] Features for Learning. In the experiment, we extracted 42 statistical features to represent domain names with reference to FANCI [26] , which are listed in Table 2 . These features can be divided into three categories, including structural features, linguistic features and statistical features.

In this section, we discuss the selection of parameters on real-world data and evaluate the effectiveness of our approach, as well as tracking ability. 

In order to select appropriate threshold N (client degree) and M (resolution degree), we count the client and resolution degree of each domain in ground truth. Figure 4 shows the Cumulative Distribution Function (CDF) of the domain number on client degree and resolution degree in one bucket. We find that more than 99% of malicious domains have client degree less than 600. We manually check these domains with relatively large client degree and find that they are some FPs, such as the CDN or Dynamic DNS. For example, the domain "ec2-52-207-234-89.compute-1.amazonaws.com" is used for Cerber ransomware reported by malwaredomainlist.com, however, it is unreasonable for us to take the secondlevel domain "amazonaws.com" that is in alexa top list as a malicious domain. In order to cover more malicious domains as well as reduce the FPs, we empirically choose a threshold N = 600. Through this threshold, we retain 99% of the malicious domains. Similarly, we have determined that the resolution degree is M = 100. We manually checked the domains we removed, such as "herokuapp.com", "igexin.com" and "cloudapp.net". All of them are domains used to provide public services, which are less likely to malicious domains. Therefore, it makes sense to remove those domains.

We use three metrics to measure the performance of our approach, namely True Positive Rate (TPR), False Positive Rate (FPR) and accuracy (ACC) respectively. T P R = |T P s| |T P s|+|F Ns| , F P R = |F P s| |T Ns|+|F P s| and ACC = |T P s|+|T Ns| |T P s|+|T Ns|+|F P s|+|F Ns| . For each of our buckets, there are around 3000 malicious domains and around 200,000 benign domains in ground truth. Total 136,827 malicious domains are labeled in two days (there are duplicate domains because they come from different buckets, but they have different neighbors). We randomly select the same number of benign domains (136,827) with malicious domains which are fed to Ringer to adopt K-fold cross validation (in this paper, K = 5). The results are shown in Table 3 . Our system implements high TPR (0.957 on average) and low FPR (0.020 on average) in domain classification.

We make the comparison with the following stateof-art baselines including FANCI [26] , node2vec [14] combined with features.

FANCI is an open-source system that extracts 21 domain features and uses machine learning (such as Support Vector Machine (SVM)) to classify domains into benign or malicious instances. Node2vec maps the nodes into the feature vectors of the low-dimensional space by unsupervised learning, which preserves the network neighborhood relation of nodes. Node2vec is also open source and publicly available. Node2vec learns the node embeddings that only encodes the information of network structure. For this reason, we first concate the node embeddings with the extracted features, and then use SVM to classify the domains. We applied these two methods on our dataset and the results are shown in the Table 3 . Both of systems achieve promising results. However, FANCI ignores the structure information, and these relations embodied in time and space are of great significance to the classification of malicious domains. Node2vec only considers the structure relationship among the domains, and we concate the learned node embeddings with the static features, which has some improvement in classification performance. Yet, node2vec can only learn associated node embeddings by using unsupervised learning, and there is nothing to do with isolated nodes. To learn the node embedding of newly added nodes, all nodes need to learn in global fashion, which is time-consuming and labor-intensive. It is obvious that the performance of our system was significantly outperform the other two systems. In order to achieve optimal results, Ringer is capable of simultaneously learning statistical and structural features, and combining intermediate representations to make full use of all multi-order information of domains.

Domains detected by Ringer can be viewed as subgraphs. For example, Ringer detects 152 malicious domain subgraphs in bucket 2. In order to express the results of our method more intuitively, we select the top 8 subgraphs according to the size of connected components. The results drawed by software gephi [3] are showed as Fig. 5 . The images vividly elaborate how attackers to organize and deploy malicious domains, which is more conducive to the future study of malicious resources. 

One significant function of Ringer is to detect potential malicious domains. As we all know, the redundancy of multi-domain names provides better flexibility for malware. Therefore, detection of newly generated (zero-day) and rarely used malicious domains is an important metric of the system. We apply the model directly on domains that are not in ground truth, and we find 6109 potentially suspicious malicious domains. For the thousands of potential malicious domains detected by our method, we use two heuristic methods to verify their wickedness conservatively. Firstly, we use two excellent systems, FANCI [26] and LSTM [29] , both of which are open source and public available. FANCI adopts machine learning on the domain features and LSTM uses deep learning technology to distinguish between good and malicious domains. The results of the classification of new suspicious domains by the two systems are as shown as Table 4 . Through two systems, there are still 530 unique domains that cannot be confirmed. Secondly, we try to find some historical snapshots of these domains or the IPs parsed from them by Virus-Total (www.virustotal.com). Although there are many public blacklists, they are not very comprehensive, some of which contain only malicious domains for the given day, while VirusTotal collects 66 authoritative blacklists. We deem to the domains that appear at least one blacklist as malicious domains by using the public API [6] . At a result, 108 new malicious domains we have found are exposed in the form of domains or IPs. We observed that there are still 422 domain names without qualitative judgment. Through analysis, a total of 5687 malicious domain names are found, and we have 93% confidence to believe that our system have ability to respond against new threats.

Ringer is scalable to large-scale DNS data, such as DNS traces from ISP. To further illustrate the scalability of our system, we analyze the complexity of Ringer. Suppose that we have N records and the number of unique domain with return value on A record is V , the complexity of Ringer is shown as follows. During the graph construction, all DNS records need to be scanned which takes O(N ). The graph convolution is our main computational workhorse. Previously spectral convolutions defined on the graph is multiplication using a filter gθ(L) = diag(θ) in Fourier domain with the given signal x that is a scalar for every node:

Where θ ∈ R V , U is the matrix of eigenvectors of the normalized graph Laplacian matrix. L = I V − D − 1 2 AD − 1 2 = UΛU T with a diagonal matrix of its eigenvalues Λ. The computational complexity of formula (6) is O(V 2 ). And for large graphs, the eigendecomposition of L is prohibitively expensive. In order to alleviate the computational cost, we adopt K-order truncated Chebyshev polynomials as [13] approximate gθ(Λ): gθ(Λ) ≈ K−1 k=0 θ k T k Λ . WhereΛ = 2 λmax Λ − I V , and λ max is the largest eigenvalue of L. The Chebyshev ploynomials is recursively defined as T k (x) = 2xT k−1 (x) − T k−2 (x), with T 0 = 1 and T 1 = x. Therefore, we have the approximations:

Where θ k is a vector of Chebyshev coefficients andL = 2 λmax L − I V is the scaled Laplacian matrix. The complexity of formula (7) is O(E), linear in the number of edges. Through the above analysis, the complexity is mainly related to the number of unique edges and the E ≤ m * V where the m is the number of sampled neighbors (constant). Figure 6 (left) shows the distribution of the number of distinct domains to be analyzed changing with the number of DNS queries in real-world data. As an illustration, the size of unique domains does not increase linearly with the size of queries, but follows Heaps' law V (N ) = αN β . In order to find the appropriate parameters α and β, we take the log2 on both sides of the equation. Then, we get log 2 (V ) = log 2 (α) + βlog 2 (N ). Figure 6 (right) shows the scattered distribution of points (log2(N ), log2(V )) (blue), and the straight line (red) fitted with the Least Squares approximation. Finally, in our dataset, the parameters α = 2.577, β = 0.6536. To sum up, the computation overhead of whole system is O(N ) linear in the number of input records, which proves scalable.

We discuss about two drawbacks that need to be considered. One potential limitation is that Ringer cannot distinguish specific service categories that the malicious domains is used for, such as fishing, spamming, C2 and so on. We lack more detailed knowledge base or ground truth to cover and label them, thus obtaining the distribution of different malicious categories. Another limitation is that for unassociated or weakly associated domains, our approach is equivalent to using only its static statistical features without neighbors relevance.

The intelligence of attackers using malicious domains makes it more resilient for existing detection methods. In this paper, we propose a malicious domain detection mechanism Ringer, which uses graphs to represent the strong correlation among domains including client similarity and resolution similarity. Dynamic GCN is used to learn the node representations that combines structural information and statistical feature information inductively. The dynamic learning depending on neighbors and itself enables great gains in both effectiveness and efficiency. Exposing the relevance of malicious domains by graph enhances the robustness of the system irrespective of some evading techniques. We use DNS data from ISP to evaluate our system, the results show that our system perform higher precision and scalability. Our approach, as a promising effort, helps to prevent illegal domain-based activities more effectively in practice.

Gephi: The Open Graph Viz Platform

DeepDGA: adversarially-tuned domain generation and detection

Building a dynamic reputation system for DNS

From throw-away traffic to bots: detecting the rise of DGAbased malware

EXPOSURE: finding malicious domains using passive DNS analysis

FastGCN: fast learning with graph convolutional networks via importance sampling

Convolutional networks on graphs for learning molecular fingerprints

node2vec: scalable feature learning for networks

Inductive representation learning on large graphs

Densely connected convolutional networks

Discovering malicious domains through passive DNS data graph analysis

Semi-supervised classification with graph convolutional networks

GMAD: graph-based malware activity detection by DNS traffic analysis

CCGA: clustering and capturing group activities for DGA-based botnets detection

Detecting malicious domains via graph inference

Spatiotemporal mining of software adoption & penetration

MalShoot: shooting malicious domains through graph embedding on passive DNS data

A comprehensive measurement study of domain generating malware

Segugio: efficient behavior-based tracking of malware-control domains in large ISP networks

FANCI: feature-based automated NXDomain classification and intelligence

Guilt by association: large scale malware detection by mining file-relation graphs

Kindred domains: detecting and clustering botnet domains using DNS traffic

Predicting domain generation algorithms with long short-term memory networks

A comprehensive survey on graph neural networks

Understanding belief propagation and its generalizations

Graph convolutional neural networks for web-scale recommender systems

Acknowledgments. This work is supported by the National Key Research and Development Program of China (Grant No. 2016YFB0801502), and the National Natural Science Foundation of China (Grant No. U1736218). We are grateful to the anonymous reviewers for their insightful comments and many thanks to Daniel Plohmann for providing us access to the DGArchive. And, I also owe my sincere gratitude to Ms. Zhou Yue, who helped me a lot during the writing of this thesis. The corresponding author of this paper is Shuhao Li.