key: cord-0057827-15g1wqfs authors: Jemal, Ines; Haddar, Mohamed Amine; Cheikhrouhou, Omar; Mahfoudhi, Adel title: ASCII Embedding: An Efficient Deep Learning Method for Web Attacks Detection date: 2021-02-22 journal: Pattern Recognition and Artificial Intelligence DOI: 10.1007/978-3-030-71804-6_21 sha: 3bf29350057f399bb7652beeafafc18921152d47 doc_id: 57827 cord_uid: 15g1wqfs Web security is a homogeneous mixture of modern machinery and software technologies designed to protect the personal and confidential data of all Internet users. After many decades of hard work in web security, the protection of personal data remains an obsession for legitimate internet users. Nowadays, artificial intelligence techniques are overcoming classical signature-based and anomaly-based techniques, which unable to detect zero-day attacks. To help reduce fraud and electronic theft at the server-side, we propose in this paper a novel deep learning method to preprocess the input of neural networks. This technique, called ASCII Embedding, aims to efficiently detect web server attacks. Using an online real dataset CSIC 2010, we evaluated and compared our technique to existing ones as word, and character embedding approaches. The experimental results prove that our technique outperforms the existing works accuracy and reaches 98.2434%. The World Wide Web has made our daily life easier. We can buy our needs, pay our bills and consult the news using the Internet. This technology is responsible for electronic fraud. Web Attacks persist even with the more sophisticated security policies and standards. Skilled hackers can overtake security controls and steals sensitive data. To enhance the web application security, the cyber world endows several prevention software tools ready-made such as Web Application Firewalls (WAF) [1] , Intrusion Detection Systems (IDS) [2] . The majority of these products use anomaly based and signature-based techniques. These techniques that are developed to secure the server, are unable to stop zero-day attacks. Nowadays, Artificial Intelligence (AI) [3] shows a significant aptness in many domains and particularly web security. Among the very popular techniques of AI comes the machine learning [4] . The key concept of this technique is making the machine learns from the existing data. After a while, the machine becomes able to get the right decision. The learning phase is based on a huge amount of data that helps the machine learns how to react to each existing situation. After this phase, the machine makes for new situations a decision with an accuracy concurrent to the human one. The more adequate technique in machine learning is the Neural Network. This technique mimics the human brain and makes somehow the machine reacts intelligently to unknown circumstances. For web security purposes, the neural network learns from a huge number of Http requests stored in a chosen dataset. To reach a higher attack detection accuracy rate, the neural network input data must be carefully preprocessed from the rough dataset requests. The most used techniques in the literature to process the Http request are the word embedding [5] and the character embedding [6] . The first transform word by word the Http request to a vector of numbers. The second approach does the same task at the character level. However, the main problem of these techniques is the unpredictable behavior of the neural network for new characters or words. To tackle this problem, in this paper, we propose a new technique of Http request preprocessing called ASCII Embedding. This technique outperforms existing techniques by increasing neural network performances. A set of experiments are conducted to measure the performance of the new method. Our technique proved experimentally that it performs better than the existing approaches, it fulfills 98.2434% of accuracy. The remainder of this paper is organized as follows: The second section presents the state of the art of neural networks preprocessing techniques. Our method is presented in Sect. 3. The evaluation of our method together with the existing techniques is conducted via an experimental test and exposed in Sect. 4. This section lays the results and compares our technique to the existing ones. The last section concludes this paper and briefly explains our future work. Studying machine learning and especially deep learning techniques have realized enormous attraction after the success proved in the image processing domain. To get benefit from this success, many domains have applied these techniques such as web security. To detect the web attacks, more strictly, detect the attacks at the server-side, many works proposed the use of deep learning techniques. Mickiaki et al. [7] investigated on Character Level Convolution Neural Network (CLCNN) to detect web application attacks. Each character included in the Http request was represented by 8 bits numeric string and converted into 128-dimensional vector expressions. The authors used the CSIC 2010 [8] dataset to train their model and test their approach. They compared two different architecture using different kernel sizes. The accuracy varied between 95% and 98%. The same dataset was used by Zhang et al. [9] , they dissected the Http request into words and deleted the non-alphanumeric characters. Their CNN model achieved 96.49% of accuracy. The authors believed that the embedding vectors generated by word embedding approach are more helpful to detect web attacks, while Hung et al. [10] considered that using word embedding has some limits because the Http request contains too many unique words that deal with memory constraints, and in many cases, it cannot obtain the embedding vector for each new words at the test time. For this reason, Hung et al. [10] proposed an advanced word embedding approach to solving the problem of many words observed. They applied the CNN to both characters and words approaches to process the URL String. The authors used the AUC ROC curve [11] to show the performance of their approach, it achieved 98%. Jane et al. [12] used two neural networks multilayer Feed Forward networks, the first neural network checking by trained web application data, while the second neural network checking by trained the behavior user. The input of the first neural network (ANN1) is defined by a set of collected information from several sections of Http request: Post Parameters, Get Parameters, Cookie-Parameters, database operations, file system operations, errors, warning. The second neural network (ANN2) stores data about user behavior as an input: address IP, country, name, version of the browser, version of the operating system, language system, screen resolution, color depth, and browser home page. The two neural networks ANN1 and ANN2 achieved 92% and 95% of accuracy, respectively. Joshua et al. [13] used the eXpose neural network based on character embedding approach to detect malicious URL, they investigated the automatic extraction of features for a short string. Their model achieved a 92% detection rate. Erxue et al. [14] used CNN to improve IDS systems. The authors used the word embedding approach to detect malicious payloads, whereas our study is on detecting attacks on the server-side. Their model achieved 95.75% of accuracy. The previously presented work has used either the word embedding or the character embedding approaches to process the Http requests. Trying to enhance the attack detection rate, we propose a new embedding approach called ASCII Embedding, which is presented in the next section. The works presented in the literature review use word or character embedding techniques to Http request preprocessing. These two approaches are widely used for text processing. The word embedding approach starts by cleaning the Http request from the non-alphanumeric symbols. Then, it maps each word to a vector of real values. The character embedding is an interpolation of word embedding to the character level. The works that used these techniques and especially word embedding show acceptable accuracy. Nevertheless, when the Http request contains too many words or new words, the decision made is very likely to be incorrect. We present in this section our new method called ASCII Embedding. The new approach is an interpolation of character embedding to integers. A replacement of characters by their machine value code (like ASCII) is considered. The idea of using ASCII code comes from the successful results showed when using the CNN models for image classification, where the features engineering are the pixels. Every pixel has a value between 0 and 255. The Http request is then transformed into a sequence of numbers. Each number represents the ASCII code of the corresponding character. We choose the Convolutional neural network to test our technique because it has a special architecture and it can learn important features from a big scale. Our new preprocessing method aims to help the neural network to detect efficiently the malicious Http requests even with new words not learned during the training phase. In our case, we are interested in the request sent from the client computer and received by the webserver. Figure 1 describes the different parts of the Http request. Each part contains eventually a set of words, characters, and symbols. To not lose any information and to ensure a high detection rate, our method is applied to the whole Http request without deleting any word, character, or even symbol. The preprocessing of the Http request consists of two successive steps. First, we split the Http request into a sequence of characters and symbols. Then, we replace each character or symbol with its integer value (the ASCII code). The result of these steps is a vector V of integer (see Fig. 2 ). The output V of the Http request preprocessing is the input of the CNN. Figure 3 presents the different layers and the hyper-parameters that shape our CNN model. The embedding layer of the CNN transforms V to a digital matrix. The height of the digital matrix l equals to the length of V , and its width d defines the size of the embedding vector (d = 128). Based on the choice of the kernel size (3, 4, 5, 6) , the convolutional layer extracts the features that it considers important. The max-pooling layer selects the features based on an activation function (Relu). These layers reduce the matrix dimension and speed up the computation. Finally, the Softmax layer takes the binary decision and classifies the Http request into two classes. We use 1 to designate the benign requests and 0 to designate the malicious ones. The neural network architecture presented in Fig. 3 is implemented in order to evaluate our method. In the next section, we present the evaluation results and compare our technique to the existing works. In this section, we evaluate our technique using the two famous splitting methods train/test and k fold cross-validation. We carried out the experiments with the online Http CSIC 2010 dataset [8] . We used the open-access dataset CSIC 2010. It was developed at the "Information Security Institute" of CSIC (Spanish Research National Council) [8] . It was generated to test web attacks protection systems. It contains thousands of Http requests automatically generated: 36000 normal traffic Http requests and 25065 malicious traffic Http requests. This dataset is close to reality as it considers several types of anomalous requests such as SQL injection, buffer overflow, information gathering, files disclosure, CRLF injection, Cross-Site Scripting (XSS), Server-side include, parameter tampering, and Unintentional illegal requests. We split the dataset with the train/test method, we divide the entire dataset into three parts: the training set, the validation set, and the test set. The training and validation set present 80% of the complete CSIC 2010 dataset, and the remaining 20% is for the test. Table 1 shows the distribution of the CSIC 2010 dataset to train, validate, and test the CNN. 63288 requests are used in the training phase and 19413 requests are used to test the efficiency of our method. The second experiment used the k fold cross-validation method. The goal of this method is to avoid under-fitting and over-fitting problems that can arise with the first method. This experiment aims to validate the results found using the first splitting method. We choose two cases k = 5 and k = 10. The whole dataset is then divided into k parts. The experiments are repeated k times. In each time, k − 1 parts are used for training and one part for testing. To start the experiments, we choose first the convolutional neural network hyperparameters. For the depth of the CNN, we choose four different kernel size (k1, k2, k3, k4) = (3, 4, 5, 6) and n = 128 filters per kernel size. We choose Adam as the optimizer, cross-entropy as the loss function, the Relu is the activation function, the batch size is equal to 64, the dropout equal to 0.5, and the embedding vector d equals to 128. To evaluate the effectiveness of our technique, we used accuracy, recall, precision and F1-Score as metrics. These metrics are based on the confusion matrix that determines the value of TP, TN, FP, and FN. Ready to start the experiments, we will present in the next subsection the results. Our experiments are conducted using the free cloud Google Collaboratory [15] as platform and software services. It provides an excellent GPU and free deep learning software. Our experiments are conducted using the two different methods of data splitting explained above. The train/test results are presented in the next subsection. The k fold cross-validation results will be presented later. With 80% of the whole CSIC 2010 dataset, we trained our model CNN using the ASCII embedding method about 6000 steps (about 7 epochs). We recorded the training loss and accuracy rates after every 100 steps. In Fig. 4 , we can remark that after 4500 steps of training, the training accuracy rate exceeds 98% while the training losses decrease rapidly towards zero. These trends of accuracy and loss reveal the good performance of our CNN model in the training phase using our new method. After 6000 steps of training and with 20% of the whole CSIC 2010 dataset, we tested the newly trained CNN using our new method. Table 2 presents the confusion matrix for the binary classification. It presents the different rates TP, TN, FP, and FN. It demonstrates that the ASCII Embedding performs well in the detection of malicious Http requests. Our model CNN based on the ASCII embedding correctly predicts 14282 from 14400 benign requests and 4767 from 5013 malicious requests. Table 3 presents the value of the different metrics obtained from the confusion matrix. Using our model CNN with the new method, the attack detection rate reached: 98.125% of accuracy rate, the precision rate is 99.1805%, the Recall rate is 98.306% and the F1-Score is 96,284%. The next subsection will present the experiments that used the k fold crossvalidation data splitting method. k-Fold Cross-validation Method. Using the Training/Test method for data splitting has some drawbacks. In some cases, either under-fitting or over-fitting can be observed. To avoid these problems, we use the K-fold cross-validation method. This method splits the dataset to k subsets, trains the model with k-1 subsets, and holds the last subset for the test. Consequently, the train/test phases are repeated k times. In the end, we are sure that each subset is used both in the training and testing phases. In these experiments, we will use as well as the accuracy, the Area Under the Curve (AUC) of the Receiver Operating Characteristic (ROC) [11] . This parameter is a sophisticated criterion for assessing the neural network performance in binary classification problems. Figure 5 presents the performance evaluation of our CNN model using the ASCII embedding method with 5 fold cross-validation. Figure 5(a) shows the different accuracy obtained for k = 5. The higher accuracy achieved 98.49% in round 3 while the less accuracy achieved 97.85% for round 2. The average accuracy obtained is 98.2434%. Figure 5(b) presents the different results of AUC-ROC. It presents the false positive rate (FPR) on the x-axis and the true positive rate (TPR) on the y-axis. It shows that the AUC attains 0.97 which is very closed to 1. Figure 6 shows the obtained results using a 10 fold cross-validation method. Figure 6 (a) presents the different accuracy found during the experimental tests. The best accuracy equals to 98.44% reached in round 8 while the less accuracy is 97.95% attained in round 7 and 9. The average accuracy reached to 98.2073%. As shown in Fig. 6(b) , the Auc reaches 0.97. This value near one proves the considerable performance of the CNN model in detecting web attacks. The experiments we did in this section have two purposes. First, they confirm the efficiency of our technique in preprocessing the input of the model CNN. Second, it outputs the results that will help us compare our technique to the existing ones. The next subsection will be the scene of the second purpose. In the previous subsection, we compared the discrimination ability of our new technique called ASCII Embedding. Table 4 summarizes the results achieved when testing the designed CNN using the new method ASCII Embedding. It shows that the new model CNN performs well in detection web application attacks, the accuracy exceeds 98%. Under the same experimental conditions (dataset, hardware, software) and using the same CNN model (same hyper-parameters values), we implemented and compared the word, and character embedding techniques together with our proposed ASCII embedding. The word and character techniques did not overtake 97.6% and 96.12%, respectively while our approach exceeds 98%. Table 5 presents the obtained results based on the attack detection accuracy criterion. Using the same CNN, we proved by experiments that our CNN based on the ASCII embedding approach achieved better accuracy, it reached 98.2434%. Using the CSIC 2010 dataset, we reveal by all the experiments described above that the ASCII Embedding was successfully used in the web attacks detection. It has a good performance in Http request classification (benign or malicious class). It achieved a high web attack detection rate with a low number of false alarms. Our technique uses a simple way to preprocess the neural network input and leads to an improvement of the accuracy rate better than the works using the word or character embedding approaches. In this paper, we pushed a step further the accuracy of neural networks for web attacks detection. We come up with a new method called ASCII Embedding to improve the detection of server-side web application attacks. Using an especially designed convolutional neural network model, our technique outperforms the existing ones and reached 98.2434%. When experimenting with ASCII Embedding, we note an extra waiting time to get the results compared to the character and word embedding techniques. In future work, we will try to decrease the time overhead. Semantic processing of the Http request can be a possible solution for this problem. Web application protection techniques: a taxonomy A survey on network intrusion detection system techniques Artificial Intelligence: A Modern Approach Machine learning and deep learning methods for cybersecurity Word embeddings for natural language processing Character-aware neural language models Web application firewall using character-level convolutional neural network A deep learning method to detect web attacks using a specially designed CNN URLNet: Learning a URL representation with deep learning for malicious URL detection The receiver operating characteristic (ROC) curve. Southwest Respir Neural network approach to web application protection Expose: a character-level convolutional neural network with embeddings for detecting malicious URLs, file paths and registry keys TR-IDS: anomaly-based intrusion detection through text-convolutional neural network and random forest Performance analysis of Google colaboratory as a tool for accelerating deep learning applications