key: cord-1017200-gr9h06fa authors: Patil, Nilesh Vishwasrao; Krishna, C. Rama; Kumar, Krishan title: SSK-DDoS: distributed stream processing framework based classification system for DDoS attacks date: 2022-01-17 journal: Cluster Comput DOI: 10.1007/s10586-022-03538-x sha: c46b636b3d7446877b2fe19e7d39d89571d29fae doc_id: 1017200 cord_uid: gr9h06fa Distributed denial of service (DDoS) is an immense threat for Internet based-applications and their resources. It immediately floods the victim system by transmitting a large number of network packets, and due to this, the victim system resources become unavailable for legitimate users. Therefore, this attack is claimed to be a dangerous attack for Internet-based applications and their resources. Several security approaches have been proposed in the literature to protect Internet-based applications from this type of threat. However, the frequency and strength of DDoS attacks are increasing day-by-day. Further, most of the traditional and distributed processing frameworks-based DDoS attack detection systems analyzed network flows in offline batch processing. Hence, they failed to classify network flows in real-time. This paper proposes a novel Spark Streaming and Kafka-based distributed classification system, named by SSK-DDoS, for classifying different types of DDoS attacks and legitimate network flows. This classification approach is implemented using a distributed Spark MLlib machine learning algorithms on a Hadoop cluster and deployed on the Spark streaming platform to classify streams in real-time. The incoming streams consume by Kafka’s topic to perform preprocessing tasks such as extracting and formulating features for classifying them into seven groups: Benign, DDoS-DNS, DDoS-LDAP, DDoS-MSSQL, DDoS-NetBIOS, DDoS-UDP, and DDoS-SYN. Further, the SSK-DDoS classification system stores formulated features with their predicted class into the HDFS that will help to retrain the distributed classification approach using a new set of samples. The proposed SSK-DDoS classification system has been validated using the recent CICDDoS2019 dataset. The results show that the proposed SSK-DDoS efficiently classified network flows into seven classes and stored formulated features with the predicted value of each incoming network flow into HDFS. Over the decade, companies have been running their services online for growing revenue and are open to users from anywhere-anytime. Further, in recent times, there is huge growth in Internet subscribers and connecting devices. However, this significant growth has come up with unsafe network routes with non-secure connecting devices. Therefore, attackers use this chance to compromise numerous nodes to form a botnet for performing DDoS attacks on the victim system. A DDoS attack is the biggest threat to Internet-based applications and their resources [1, 2] . The motive of this attack is to overwhelm Internet-based services by transmitting a large amount of attack traffic [3, 4] . A typical example to perform the DDoS attack on the victim system is presented in Fig. 1 . In this, a master took control of various slaves with the help of handler programs. The handler is the inter-mediator program between master and In this big data world, the traditional framework-based DDoS attack detection approaches themselves become the victim while examining a massive number of packets. Therefore, there is a need to deploy the proposed approach on distributed stream processing framework (DSPF). The DSPF has the capability to handle (store and analyze) a large volume of data in real-time by employing multiple nodes. Further, data transfer between nodes, secure communication protocol, and metadata information is systematically managed by DSPF. The traditional and distributed processing frameworks (DPF) based DDoS attack detection systems are specially designed to examine flows in an offline mode. Therefore, this type of approach fails to analyze incoming streams in real-time. Additionally, most of the approaches have been tested on outdated datasets. Therefore, there is a need to design a distributed classification model using a recent dataset and deploy it on DSPF (such as the Spark Streaming platform). In this section, we are going to summarize the open-source technologies that are required to design the proposed SSK-DDoS classification systems for DDoS attacks. We split-up this section into four sub-sections: Apache Hadoop, Spark Streaming, Apache Kafka, and CICFlowMeter. A good DSPF must have the following features: 1. To analyze the streaming data such as network traffic flows as it receives and takes immediate action based on prediction. 2. To design real-time applications which have a looselycoupled architecture. Therefore, multiple publishers and consumers can independently access the application without delay. Fig. 2 Comparison of Q3-2020 and Q4-2020 country-wise statistics distribution of DDoS attacks [5, 6] Cluster Computing 3. To have features like analyze data in a distributed manner, extremely low latency, reliability, scalable, fault-tolerant, etc. Apache Hadoop [7, 8] is one of the powerful DPF for storing and analyzing a large amount of data. It is specially designed to analyze a large amount of data using batch processing on a cluster of nodes. It consists of three major modules: [10] [11] [12] [13] have systematically discussed machine/deep learning methods and features selection. However, when we design a model using these techniques that will face the scalability issue when deployed on DPF/DSPF. The Spark MLlib machine learning library provides a way to design a distributed and in-memory machine learning model. This type of model is specially designed to deploy on DPF/DSPF (Hadoop, Kafka, Spark, etc.) . Therefore, it is exciting to implement a distributed classification approach for DDoS attacks using the MLlib and deploy it on the Spark streaming platform. Apache Kafka [14] is an open-source distributed and highthroughput publish-subscribe messaging system. It consists of six essential components: Brokers, Zookeeper, Topics, Partitions, Publishers, and Subscribers. The publishing/consuming feature of Kafka helps to provide a looselycoupled architecture to real-time applications. CICFlowMeter [15] is an open-source network flow generator tool. It creates network flows in offline (from PCAP) and online (from network interfaces) mode. It creates 83 attributes and stores them in a CSV file from network traffic. An example of CICFlowMeter for collecting network packets using the network interface card and generating network flows from network packets is presented in Fig. 3 . Rest of the paper is organized as follows. A summary of related works presented in Sect. 2. Section 3 presents a novel distributed SSK-DDoS classification system for DDoS attacks. Section 4 provides testbed information of the classification approach. Results and analysis is presented in Sect. 5. Finally, Sect. 6 conclude the paper. Numerous security approaches are available in the literature to protect the victim systems from different DDoS attacks. Patil et al. [16] have systematically classified DDoS attack detection approaches into two broad classes based on their deployment frameworks: traditional and DPF based detection approaches. In the literature [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] , several authors systematically summarized traditional framework based approaches and few of the recent existing systems are [31] [32] [33] . However, few authors [16] specifically addressed DPF based approaches. The DPF (batch processing) and DSPF (real-time) themselves have distributed designs to store and analyze a massive volume of data on a cluster of nodes. In the literature, some authors , proposed DPF and DSPF based approaches. However, most of them are deployed on the DPF. Therefore, this type of detection approach efficiently analyzes a large number of packets and classifies them in a short time. However, they are not capable to classify network flows in real-time. This type of approach is useful for historical data analysis and retrain the distributed model. Therefore, if use-case demands to classify network flows in real-time then one need to deploy the proposed approach on DSPF (such as Spark Streaming platform). We have drawn some inferences from the existing works related to DPF/DSPF. They are listed as follows: -Most of the systems are designed and tested in an offline mode. Therefore, there is a need to deploy a classification model for DDoS attacks on DSPF such as Apache Spark Streaming that analyzes network traffic in real-time. -Few researchers designed their classification model using shallow and deep learning algorithms. These models performed exceptionally well when we deployed on traditional frameworks. However, models will undergo the scalability issue when deployed on DPF/DSPF. Therefore, there is a need to implement a distributed model using distributed machine learning library that will provide a high scalability feature even models deployed on DPF/DSPF. -Most of the DPF/DSPF based DDoS approaches efficiently analyzed a huge amount of network flows on a group of nodes by distributing the analysis task on multiple systems. -Most of the existing DPF/DSPF based DDoS mechanisms employed a counter-based detection methodology for identifying the high-volume of attacks. Therefore, this type of system fails to recognize a low-volume of DDoS attacks. -Most of the DPF/DSPF and traditional frameworkbased DDoS mechanisms are validated using outdated datasets. Few authors [55] designed there system using recent dataset. Therefore, there is a need for a new classification approach that can be validated using recent datasets, such as CICDDoS2019. This section presents the functioning of the proposed SSK-DDoS classification system for DDoS attacks. The logical architecture of SSK-DDoS is given in Fig. 4 . The distributed SSK-DDoS classification system of DDoS attacks is consists of three Spark Streaming clusters: 'SC-1', 'SC-2', and 'SC-3'. Two Spark clusters 'SC-1' and 'SC-2' are deployed in the intermediate network i.e., at ISP-1 and ISP-2 respectively. The primary job of 'SC-1' and 'SC-2' clusters is to preprocess the incoming network traffic and pass it on to 'SC-3'. While the 'SC-3' cluster is deployed in the victim network and the job of this cluster is to classify flows into seven classes. The first step is producer agents (from ISP-1 and ISP-2) continuously publishing network flows generated by CICFlowMeter onto the ''ssk_ddos_flow'' topic. Both 'SC-1' and 'SC-2' clusters immediately consume flows from ''ssk_ddos_flow'' topic. The second step is to extract essential variables from flows, formulate features using extracted variables, and publish them on ''sss_ddos_features'' topic. Then 'SC-3' cluster immediately consumes formulated features of each flow from ''sss_ddos_features'', classify them into seven classes, and publish predicted class on the ''sss-ddos_prediction'' topic to take action. Further, this system stores formulated features of each flow with predicted class into the HDFS that will help to retrain the distributed classification model of DDoS attacks using a new set of samples. Highlights of the proposed distributed SSK-DDoS classification system of DDoS attacks are as follows: -Loosely-coupled architecture as it uses distributed publish-subscribe messaging system for communication -Analyze network traffic flows in real-time using Spark Streaming API -Distributed computational overhead between three clusters The detection approach of the proposed SSK-DDoS classification system splits into two parts: preprocessing and classification task. The role of 'SC-1' and 'SC-2' clusters is to consume network traffic, generate network flows using CICFlowMeter, select significant variables, scale selected variable, formulate features using scaled variables, and finally publish it on the ''ssk_ddos_features''. Both 'SC-1' and 'SC-2' have a separate Kafka topic with the same name ''ssk_ddos_features''. We split this section into three subsections: create network flows, scaling variables, and formulating features. The CICFlowMeter generates network flows with 83 attributes from incoming traffic and puts flows in a CSV file. We employ producer agents to immediately pick up each entry from CSV and publish flows on the ''ssk_ddos_flow'' topic. The next task perform by 'SC-1' and 'SC-2' clusters is to select 23 significant variables from each flow. In [56] , 24 significant variables are used to classify flows into different classes. However, in these 24 variables, two variables such as Fwd_Header_Length and Fwd_Header_Length.1 look like duplicate columns. Further, after generating network flows using the current version of CICFlowMeter, the Fwd_Header_Length.1 variable is removed from generated network flows. Therefore, we have selected 23 variables from the variable list of each network flows. The next job performed by both clusters is to scaling data values of twenty-three variables on the same scale. The scaling of data points can be adjusted with the help of the ''MinMax'' technique provided by the ''sklearn.preprocessing''. Therefore, after the scaling process, data point values lie between 0 and 1. The mathematical formula for the scaling is: Both 'SC-1' and 'SC-2' formulate ten features from 23 selected variables. It helps to enhance the accuracy and speed up the design process of the classification model. A summary of each feature is given in Table 1 . After formulating features by 'SC-1' and 'SC-2' has been replicated to 'SC-3'. In this section, we present a distributed classification approach of the proposed SSK-DDoS for identifying vari- We divided this section into three sub-sections: details of the CICDDoS2019 dataset, designing and after deployment process of the classification model. The CICDDoS2019 [56] dataset is a collective project of the ''Canadian Communications Security Establishment (CSE) and Canadian Institute for Cybersecurity (CIC)''. It includes both benign and various types of DDoS attack scenarios. This dataset is available in both PCAP and CSV files i.e., raw packets and network flow with labeling, respectively. However, CSV files have several issues. Therefore, we generated network flows from PCAP files for various scenarios such as DDoS-UDP, DDoS-LDAP, DDoS-DNS, DDoS-SYN, DDoS-MSSQL, DDoS-Net-BIOS, and Benign using the CICFlowMeter flow generator tool. The newly generated network flows contain 83 variables and one label column that we have to update as per the attack-wise schedule of PCAP files given on the dataset portal. The step-by-step process to implement a distributed classification model for DDoS attacks using MLlib library is shown in Fig. 5 . However, the number of flows in each class is highlyimbalanced which affects the accuracy of the classification model. We up-sampled some classes to 5071011. Therefore, the number of flows in the sample is 35 million? and are stored in the HDFS. The next step is to implement a distributed classification model of DDoS attacks. We designed this classification model using Spark MLib machine learning-based algorithms: DTC, MLR, NB, and RF. Then deploy this model on the Spark Streaming cluster. The next task is to calculate performance evaluation metrics: precision, recall, and f1-score. The performance evaluation of these algorithms is discussed in Sect. 5. Finally, we save this model in the persistent storage for deploying in the 'SC-3' Spark Streaming cluster to analyze flows in real-time. The second part of the classification approach is to classify incoming network traffic into seven classes. In this section, we explore the experimental setup of the proposed distributed SSK-DDoS classification system for DDoS attacks. It is shown in Fig. 7 In this section, we evaluate the performance of our proposed SSK-DDoS classification system of DDoS attacks. The proposed SSK-DDoS classification system classifies network flows into seven classes. We considered two cases for performance evaluation of the proposed SSK-DDoS classification system: case (I) While designing the classification model of DDoS attacks and case (II) After deployment of this classification model on DSPF i.e., Spark Streaming. For this, we measure three performance evaluation metrics for multi-class classification. The mathematical definition of these metrics for multi-class (in this use-case, seven target classes) classification: Precision (P m class ), Recall (R m class ), and F1-score (F1S m class ) are given in the following: We designed and validated the proposed classification model using the CICDDoS2019 dataset. For evaluation of case-I, the description of class-wise network flows is given in Table 2 . We designed this model using four Spark MLlib machine learning algorithms: DTC, MLR, NB, and RF. We visualized multiclass confusion matrices in Fig. 8 and evaluation metrics in Table 3 . According to the accuracy, RF (89.05%) has given a better accuracy than the other three, i.e., MLR (43.28%) NB (69.39%) and DTC (87.61%). Further, we have tuned the number of trees (T ¼ 10; 20; 50) parameter for the RF algorithm. We come across that RF gives better accuracy for T ¼ 50 (89.05%) than T ¼ 10 (87.89%) and T ¼ 10 (87.91%). For evaluation of the case-II, we examined six scenarios with different combinations of the CICDDoS2019 dataset classes. The description of each scenario is presented in Fig. 7 Testbed for the proposed SSK-DDoS classification system for DDoS attacks Table 4 . After designing the classification model using various algorithms, the RF-based classification model (T ¼ 50) has given better classification accuracy than MLR, NB, RF (T ¼ 10), RF (T ¼ 20), and DTC algorithms. Therefore, we deployed the RF-based classification model (T ¼ 50) on the 'SC-3' Spark Streaming cluster in the production environment. The performance evaluation of these six scenarios is given in Table 5 and visualized their multi-class confusion matrices in Fig. 9 . From the performance evaluation of the proposed SSK-DDoS for case-II, the RF-based classification model (T ¼ 50) provides a better accuracy such as scenario-I: 99.44%, scenario-II: 87.09%, scenario-III: 91.04%, scenario-IV: 99.17%, scenario-V: 92.17%, and scenario-VI: 94.42%. From this, we conclude that the proposed classification model gives 87%? accuracy even attackers launch different types of attacks concurrently on the victim system. In the case of the traditional framework-based DDoS attack detection mechanisms, each network flows is analyzed at a single point. Therefore, the time complexity of the system is O(NNF), where NNF is the number of network flows analyzed by the system [63] . However, in the case of DPF/ DSPF, the network flows analysis task is distributed between multiple nodes, and hence complexity is also distributed, say n (where n: no. of nodes). To measure the complexity of the proposed system, we assume each node equally examined network flows. Therefore, the complexity of DPF/DSPF is Oð NNF n Þ. In this case, we have to measure one more parameter that is intermediate communication cost between nodes. Let us assume intermediate communication cost is O(ICC). Therefore, the combined complexity cost (CCC) of the DPF/DSPF is CCC ¼ Oð NNF n Þ þ OðICCÞ. However, DPF/DSPF is specially designed to analyze a large amount of data and hence O(ICC) is negligible when we compared O(NNF) with O(ICC). Therefore the CCC of the DPF/DSPF-based DDoS attack detection system is Oð NNF n Þ. It shows that the time complexity will go down as increasing nodes in the cluster. In this section, we systematically compared of the proposed SSK-DDoS classification system of DDoS attacks with existing DPF and traditional framework based systems [34, 35, 37-39, 41-45, 47, 47-49, 57] in Tables 6 and 7 . Most of the DPF-based classification approaches [34, 35, 37-39, 44, 45, 47, 47, 48] of DDoS attacks and legitimate traffic are deployed on the Apache Hadoop framework. This type of approach efficiently handles a large number of flows on a cluster of nodes. However, Apache Hadoop is particularly employed to examine large data in offline mode. Therefore, this type of classification approach is not capable to classify network packets in realtime. Few [41-43, 49, 57] authors have proposed Apache Spark-based classification approaches for DDoS attacks and legitimate traffic. This type of approach examines network flows in near to real-time. Further, these systems didn't provide an automated way to take action on incoming traffic flows. However, the proposed SSK-DDoS classification approach for DDoS attacks is not only designed on DPF (Using Spark MLlib machine learning library on Hadoop cluster) but also deployed on DSPF (Spark Streaming) . Therefore, the proposed system provides a high-scalability feature. Further, we used Kafka's distributed pub-sub messaging system that will help to provide a loosely-coupled and automated-way to the proposed SSK-DDoS classification system for DDoS attacks. Sharafaldin et al. [56] have generated a realistic dataset by considering various attack scenarios. Further, they have proposed a detection approach to classify different types of DDoS attacks. According to their performance evaluation, precision values for classifiers ID3, RF, NB, and LR is 0.78, 0.77, 0.41, and 0.25, respectively. While our RFbased classification model has given a better precision value (0.89). A distributed denial of service attack is one of the biggest threats to Internet-based services and their resources. It overwhelms victim resources in a short time by sending a large number of network packets. The traditional framework-based approaches themselves become a victim of attacks while classifying a massive amount of network flows. Further, most of the existing DPF-based classification systems for DDoS attacks were specially designed for offline mode and hence not capable to classify network flows in real-time. This paper proposed Spark Streaming and Kafka-based distributed classification system for DDoS attacks, named by SSK-DDoS. This classification approach is designed using a distributed Spark MLlib machine learning library on a Hadoop cluster and deployed on the Spark streaming platform to classify the network traffic in real-time into seven classes: Benign, DDoS-DNS, DDoS-LDAP, DDoS-MSSQL, DDoS-NetBIOS, DDoS-UDP, and DDoS-SYN. Further, this system stored formulated features with the predicted class of each flow into the HDFS for retraining the existing distributed classification model using a new set of samples. The proposed SSK-DDoS classification system has been validated using the recent CICDDoS2019 dataset. The results show that the proposed SSK-DDoS detection system efficiently (89.05%) classified network traffic into seven classes. Conflict of interest The authors declared that they have no conflict of interest. Lion IDS: a meta-heuristics approach to detect DDOS attacks against software-defined networks Enhanced method of ANN based model for detection of DDoS attacks on multimedia Internet of Things D-FACE: an anomaly based distributed approach for early detection of DDoS attacks and flash events An anomaly based distributed detection system for DDoS attacks in Tier-2 ISP networks DoS attacks Q4-2020 DDoS attacks Q3-2020 Analyzing BigData with Hadoop cluster in HDInsight azure Cloud A full migration BBO algorithm with enhanced population quality bounds for multimodal biomedical image registration A multi-phase blending method with incremental intensity for training detection networks DRCDN: learning deep residual convolutional dehazing networks MLFS-CCDE: multi-objective large-scale feature selection by cooperative coevolutionary differential evolution Characterization of tor traffic using time based features Distributed frameworks for detecting distributed denial of service attacks: a comprehensive review, challenges and future directions A taxonomy of DDoS attack and DDoS defense mechanisms A survey of defense mechanisms against distributed denial of service (DDoS) flooding attacks Defense mechanisms against distributed denial of service attacks: a survey Survey of networkbased defense mechanisms countering the DoS or DDoS problems Network anomaly detection: methods, systems and tools DDoS attacks and defense mechanisms: classification and state-of-the-art Network attacks: taxonomy, tools and systems Distributed denial of service: taxonomies of attacks, tools and countermeasures Distributed denial of service attacks and defense mechanisms: current landscape and future directions A survey of distributed denial-of-service attack, prevention, and mitigation techniques Characterization and comparison of DDoS attack tools and traffic generators: a review ICMPv6-based DoS and DDoS attacks defense mechanisms Survey on DDoS defense mechanisms Detection and mitigation of DDoS attacks in SDN: a comprehensive review, research challenges and future directions Detecting network cyber-attacks using an integrated statistical approach A hybrid fog-cloud approach for securing the Internet of Things Flow based anomaly intrusion detection system using ensemble classifier with feature impact scale Detecting DDoS attacks with Hadoop DOFUR: DDoS Forensics Using MapReduce A neural-network based DDoS detection system using Hadoop and HBase Secured network from distributed DoS through Hadoop Efficacy of live DDoS detection with Hadoop HADEC: a Hadoop based Live DDoS detection framework Detection DDoS attacks based on neural-network using Apache Spark DDoS attack detection system: utilizing classification algorithms with Apache Spark DDoS detection system: utilizing gradient boosting algorithm and Apache Spark DDoS attacks analysis in bigdata (Hadoop) environment Faster detection and prediction of DDoS attacks using MapReduce and time series analysis Hadoop-based analytic framework for cyber forensics E-had: a distributed and collaborative detection framework for early detection of DDoS attacks Apache hadoop based distributed denial of service detection framework Real-time DDoS detection based on entropy using Hadoop framework Apache Spark based real-time DDoS detection system Detection of distributed denial of service attack using DLMN algorithm in hadoop Detection of dns ddos attacks with random forest algorithm on spark Detection of ddos attacks in openstack-based private cloud using apache spark An intelligent and time-efficient DDoS identification framework for real-time enterprise networks SAD-F: spark based anomaly detection framework Distributed anomaly detection using concept drift detection based hybrid ensemble techniques in streamed network data A feature reduction based reflected and exploited DDoS attacks detection system Developing realistic distributed denial of service (DDoS) attack dataset and taxonomy A DDoS attack detection system based on spark framework Detection of HTTP flooding attacks in cloud using fuzzy bat clustering D-FAC: a novel /-divergence based distributed DDoS defense system Smart detection: an online approach for DoS/DDoS attack detection using machine learning A generalized machine learning-based model for the detection of DDoS attacks A transparent and scalable anomaly-based DoS detection method Modern Computer Arithmetic