key: cord-0043231-754q67o1 authors: Nguyen, Tuan; Le, Trung; Nguyen, Khanh; Vel, Olivier de; Montague, Paul; Grundy, John; Phung, Dinh title: Deep Cost-Sensitive Kernel Machine for Binary Software Vulnerability Detection date: 2020-04-17 journal: Advances in Knowledge Discovery and Data Mining DOI: 10.1007/978-3-030-47436-2_13 sha: 158fa09fbea0154dbe707e4bf507117862bb4157 doc_id: 43231 cord_uid: 754q67o1 Owing to the sharp rise in the severity of the threats imposed by software vulnerabilities, software vulnerability detection has become an important concern in the software industry, such as the embedded systems industry, and in the field of computer security. Software vulnerability detection can be carried out at the source code or binary level. However, the latter is more impactful and practical since when using commercial software, we usually only possess binary software. In this paper, we leverage deep learning and kernel methods to propose the Deep Cost-sensitive Kernel Machine, a method that inherits the advantages of deep learning methods in efficiently tackling structural data and kernel methods in learning the characteristic of vulnerable binary examples with high generalization capacity. We conduct experiments on two real-world binary datasets. The experimental results have shown a convincing outperformance of our proposed method over the baselines. Software vulnerabilities are specific flaws or oversights in a piece of software that can potentially allow attackers exploit the code to perform malicious acts including exposing or altering sensitive information, disrupting or destroying a system, or taking control of a computer system or program. Because of the ubiquity of computer software and the growth and the diversity in its development process, a great deal of computer software potentially possesses software vulnerabilities. This makes the problem of software vulnerability detection an important concern in the software industry and in the field of computer security. Software vulnerability detection consists of source code and binary code vulnerability detection. Due to a large loss of the syntactic and semantic information provided by high-level programming languages during the compilation process, binary code vulnerability detection is significantly more difficult than source code vulnerability detection. In addition, in practice, binary vulnerability detection is more applicable and impactful than source code vulnerability detection. The reason is that when using a commercial application, we only possess its binary and usually do not possess its source code. The ability to detect the presence or absence of vulnerabilities in binary code, without getting access to source code, is therefore of major importance in the context of computer security. Some work has been proposed to detect vulnerabilities at the binary code level when source code is not available, notably work based on fuzzing, symbolic execution [1] , or techniques using handcrafted features extracted from dynamic analysis [4] . Recently, the work of [10] has pioneered learning automatic features for binary software vulnerability detection. In particular, this work was based on a Variational Auto-encoder [7] to work out representations of binary software so that representations of vulnerable and non-vulnerable binaries are encouraged to be maximally different for vulnerability detection purposes, while still preserving crucial information inherent in the original binaries. By nature, datasets for binary software vulnerability detection are typically imbalanced in the sense that the number of vulnerabilities is very small compared to the volume of non-vulnerable binary software. Another important trait of binary software vulnerability detection is that misclassifying vulnerable code as non-vulnerable is much more severe than many other misclassification decisions. In the literature, kernel methods in conjunction with the max-margin principle have shown their advantages in tackling imbalanced datasets in the context of anomaly and novelty detection [13, 18, 21] . The underlying idea is to employ the max-margin principle to learn the domain of normality, which is decomposed into a set of contours enclosing normal data that helps distinguish normality against abnormality. However, kernel methods are not able to efficiently handle sequential machine instructions in binary software. In contrast, deep recursive networks (e.g., recurrent neural networks or bidirectional recurrent neural networks) are very efficient and effective in tackling and exploiting temporal information in sequential data like sequential machine instructions in binary software. To cope with the difference in the severity level of the kinds of misclassification, cost-sensitive loss has been leveraged with kernel methods in some previous works, notably [2, 5, 12] . However, these works either used non-decomposable losses or were solved in the dual form, which makes them less applicable to leverage with deep learning methods in which stochastic gradient descent method is employed to solve the corresponding optimization problem. To smoothly enable the incorporation of kernel methods, cost-sensitive loss, and deep learning in the context of binary code vulnerability detection, we propose a novel Cost-sensitive Kernel Machine (CKM) which is developed based on the max-margin principle to find two optimal parallel hyperplanes and employs cost sensitive loss to find the best decision hyperplane. In particular, our CKM first aims to learn two parallel hyperplanes that can separate vulnerability and non-vulnerability, while the margin which is defined as the distance between the two parallel hyperplanes is maximized. The optimal decision hyperplane of CKM is sought in the strip formed by the two parallel hyperplanes. To take into account the difference in importance level of two kinds of misclassification, we employ a cost-sensitive loss, where the misclassification of vulnerability as non-vulnerability is assigned a higher cost. We conduct experiments over two datasets, the NDSS18 binary dataset whose source code was collected and compiled to binaries in [10, 15] and binaries compiled from 6 open-source projects, which is a new dataset created by us. We strengthen and extend the tool developed in [10] to allow it to be able to handle more errors for compiling the source code in the six open-source projects into binaries. Our experimental results over these two binary datasets show that our proposed DCKM outperforms the baselines by a wide margin. The major contributions of our work are as follows: -We upgrade the tool developed in [10] to create a new real-world binary dataset. -We propose a novel Cost-sensitive Kernel Machine that takes into account the difference in incurred costs of different kinds of misclassification and imbalanced data nature in binary software vulnerability detection. This CKM can be plugged neatly into a deep learning model and be trained using backpropagation. -We leverage deep learning, kernel methods, and a cost-sensitive based approach to build a novel Deep Cost-sensitive Kernel Machine that outperforms state-of-the-art baselines on our experimental datasets by a wide margin. By incorporating deep learning and kernel methods, we propose a Deep Costsensitive Kernel Machine (DCKM) for binary software vulnerability detection. In particular, we use a bidirectional recurrent neural network (BRNN) to summarize a sequence of machine instructions in binary software into a representation vector. This vector is then mapped into a Fourier random feature space via a finitedimensional random feature map [9, 11, 14, 17, 19] . Our proposed Cost-sensitive Kernel Machine (CKM) is invoked in the random feature space to detect vulnerable binary software. Note that the Fourier random feature map which is used in conjunction with our CKM and BRNN enables our DCKM to be trained nicely via back-propagation. Figure 1 presents an overview of the code data processing steps required to obtain the core parts of machine instructions from source code. From the source code repository, we identify the code functions and then fix any syntax errors using our automatic tool. The tool also invokes the gcc compiler to compile compilable functions into binaries. Subsequently, utilizing the objdump 1 tool, we disassemble the binaries into assembly code. Each function corresponds to an assembly code file. We then process the assembly code files to obtain a collection of machine instructions and eventually use the Capstone 2 framework to extract their core parts. Each core part in a machine instruction consists of two components: the opcode and the operands, called the instruction information (a sequence of bytes in hexadecimal format, i.e., memory location, registers, etc.). The opcode indicates the type of operation, whilst the operands contain the necessary information for the corresponding operation. Since both opcode and operands are important, we embed both the opcode and instruction information into vectors and then concatenate them. To embed the opcode, we undertake some preliminary analysis and find that there were a few hundred opcodes in our dataset. We then build a vocabulary of the opcodes, and after that embed them using one-hot vectors to obtain the opcode embedding e op . To embed the instruction information, we first compute the frequency vector as follows. We consider the operands as a sequence of hexadecimal bytes (i.e., 00, 01 to F F ) and count the frequencies of the hexadecimal bytes to obtain a frequency vector with 256 dimensions. The frequency vector is then multiplied by the embedding matrix to obtain the instruction information embedding e ii . More specifically, the output embedding is e = e op e ii where e op = one-hot(op) × W op and e ii = freq (ii) × W ii with the opcode op, the instruction information ii, one-hot vector one-hot(op), frequency vector freq (ii), and the embedding matrices W op ∈ R V ×d and W ii ∈ R 256×d , where V is the vocabulary size of the opcodes and d is the embedding dimension. The process of embedding machine instructions is presented in Fig. 2 . We now present the general framework for our proposed Deep Cost-sensitive Kernel Machine. As shown in Fig. 3 , given a binary x, we first embed its machine instructions into vectors (see Sect. 2.1); the resulting vectors are then fed to a Bidirectional RNN with the sequence lenght of L to work out the representation L-th hidden states (the left and right last hidden states) of the Bidirectional RNN, respectively. Finally, the vector representation h is mapped to a random feature space via a random feature mapΦ (·) [19] where we recruit a cost-sensitive kernel machine (see Sect. 2.3) to classify vulnerable and non-vulnerable binary software. Note that the formulation forΦ is as follows: where ω 1 , . . . , ω D are the Fourier random elements as in [19] and the dimension of random feature space is hence 2D. We note that the use of a random feature map in conjunction with costsensitive kernel machine and bi-directional RNN allows us to easily do backpropagation when training our Deep Cost-sensitive Kernel Machine. In addition, let us denote the training set of binaries and their labels by where x i is a binary including many machine instructions and y i ∈ {−1; 1} where the label −1 stands for vulnerable binary and the label 1 stands for nonvulnerable binary. Assume that after feeding the binaries x 1 , . . . , x N into the corresponding BRNN as described above, we obtain the representations h 1 , . . . , h N . We then map these representations to the random feature space via the random feature mapΦ (·) as defined in Eq. (1). We finally construct a cost-sensitive kernel machine (see Sect. 2.3) in the random feature space to help us distinguish vulnerability against non-vulnerability. General Idea of Cost-Sensitive Kernel Machine. We first find two parallel hyperplanes H −1 and H 1 in such a way that H −1 separates the non-vulnerable and vulnerable classes, H 1 separates the vulnerable and non-vulnerable classes, and the margin, which is the distance between the two parallel hyperplanes H −1 and H 1 , is maximized. We then find the optimal decision hyperplane H d by searching in the strip formed by H −1 and H 1 (see Fig. 4 ). Let us denote the equations of H −1 and H 1 by We arrive at the optimization problem: by a factor k > 0 as (kw, kb −1 , kb 1 ). Therefore, we can safely assume that b 1 − b −1 = 1, and hence the following optimization problem: are non-negative slack variables and λ > 0 is the regularization parameter. The primal form of the soft model optimization problem is hence of the following form: Finding the Optimal Decision Hyperplane. After solving the optimization problem in Eq. (2), we obtain the optimal solution w * , b * −1 , b * 1 where b * −1 = a * and b * 1 = 1 + a * for the two parallel hyperplanes. Let us denote the strip S formed by the two parallel hyperplanes and the set of training examples I in this strip as: where u, v lie in the random feature space R 2D . Fig. 4 . Cost-sensitive kernel machine in the random feature space. We first find two optimal parallel hyperplanes H1 and H−1 with maximal margin and then search for the optimal decision hyperplane in the strip S formed by H1 and H−1 to balance between precision and recall for minimizing the cost-sensitive loss and obtaining a good F1 score. As shown in Fig. 4 , when sliding a hyperplane from H 1 to H −1 , the recall is increased, but the precision is decreased. In contrast, when sliding a hyperplane from H −1 to H 1 , the precision is increased, but the recall is decreased. We hence desire to find out the optimal decision hyperplane to balance between precision and recall for minimizing the cost-sensitive loss and obtaining good F1 scores. We also conduct intensive experiments on real datasets to empirically demonstrate this intuition in Sect. 3.4. Inspired by this observation, we seek the optimal decision hyperplane H d by minimizing the cost-sensitive loss for the training examples inside the strip S, where we treat the two kinds of misclassification unequally. In particular, the cost of misclassifying a non-vulnerability as a vulnerability is 1, while misclassifying a vulnerability as a non-vulnerability is θ. The value of θ, the relative cost between two kinds of misclassification, is set depending on specific applications. In this application, we set θ = #non-vul : #vul >> 1, which makes sense because, in binary software vulnerability detection, the cost suffered by classifying vulnerable binary code as non-vulnerable is, in general, much more severe than the converse. Let |I| = M where |·| specifies the cardinality of a set and arrange the elements of I according to their distances to H −1 as I = {i 1 . We now define the costsensitive loss for a given decision hyperplane: and the optimal decision hyperplane (w * ) where the indicator function I S returns 1 if S is true and 0 if otherwise. It is worth noting if #non-vul ≈ #vul (i.e., θ ≈ 1), we obtain a Support Vector Machine [3] and if #non-vul >> #vul (i.e., θ ≈ 0), we obtain a Oneclass Support Vector Machine [21] . We present Algorithm 1 to efficiently find the optimal decision hyperplane. The general idea is to sequentially process the M (yi t == 1)?loss = loss + θ : loss = loss − 1; 7: if loss < minLoss then 8: minLoss = loss; m * = t; 9: end if 10: t = t + 1; 11: until t > M + 1 Creating labeled binary datasets for binary code vulnerability detection is one of the main contributions of our work. We first collected the source code from two datasets on GitHub: NDSS18 3 and six open-source projects 4 collected in [16] and then processed to create 2 labeled binary datasets. The NDSS18 binary dataset was created in previous work [10] -the functions were extracted from the original source code and then compiled successfully to obtain 8, 991 binaries using an automated tool. However, the source code in the NDSS18 dataset involves the code weaknesses CWE119 and CWE399, resulting in short source code chunks used to demonstrate the vulnerable examples, hence not perfectly reflecting real-world source code, while the source code files collected from the six open-source projects, namely FFmpeg, LibTIFF, LibPNG, VLC, Pidgin and Asterisk are all real-world examples. The statistics of our binary datasets are given in Table 1 . We compared our proposed DCKM with various baselines: -BRNN-C, BRNN-D: A vanilla Bidirectional RNN with a linear classifier and two dense layers on the top. -Para2Vec: The paragraph-to-vector distributional similarity model proposed in [8] which allows us to embed paragraphs into a vector space which are further classified using a neural network. -VDiscover: An approach proposed in [4] that utilizes lightweight static features to "approximate" a code structure to seek similarities between program slices. -VulDeePecker: An approach proposed in [15] for source code vulnerability detection. -BRNN-SVM: The Support Vector Machine using linear kernel, but leveraging our proposed feature extraction method. -Att-BGRU: An approach developed by [22] for sequence classification using the attention mechanism. -Text CNN: An approach proposed in [6] using a Convolutional Neural Network (CNN) to classify text. -MDSAE: A method called Maximal Divergence Sequential Auto-Encoder in [10] for binary software vulnerability detection. -OC-DeepSVDD: The One-class Deep Support Vector Data Description method proposed in [20] . The implementation of our model and the binary datasets for reproducing the experimental results can be found online at https://github.com/tuanrpt/ DCKM. For our datasets, we split the data into 80% for training, 10% for validation, and the remaining 10% for testing. For the NDSS18 binary dataset, since it is used for the purpose of demonstrating the presence of vulnerabilities, each vulnerable source code is associated with its fixed version, hence this dataset is quite balanced. To mimic a real-world scenario, we made this dataset imbalanced by randomly removing vulnerable source code to keep the ratio #vul : #non-vul = 1 : 50. For the dataset from six open-source projects, we did not modify the datasets since they are real-world datasets. We employed a dynamic BRNN to tackle the variation in the number of machine instructions of the functions. For the BRNN baselines and our models, the size of the hidden unit was set to 128 for the six open-source projects's binary dataset and 256 for the NDSS18 dataset. For our model, we used Fourier random kernel with the number of random features 2D = 512 to approximate the RBF kernel, defined as K (x, x ) = exp −γ x − x 2 , wherein the width of the kernel γ was searched in {2 −15 , 2 −3 } for the dataset from 6 open-source projects and NDSS18 dataset, respectively. The regularization parameter λ was 0.01. We set the relative cost θ = #non-vul : #vul. We used the Adam optimizer with an initial learning rate equal to 0.0005. The minibatch size was set to 64 and results became promising after 100 training epochs. We implemented our proposed method in Python using Tensorflow, an open-source software library for Machine Intelligence developed by the Google Brain Team. We ran our experiments on a computer with an Intel Xeon Processor E5-1660 which had 8 cores at 3.0 GHz and 128 GB of RAM. For each dataset and method, we ran the experiment five times and reported the average predictive performance. Experimental Results on the Binary Datasets. We conducted a variety of experiments on our two binary datasets. We split each dataset into three parts: the subset of Windows binaries, the subset of Linux binaries, and the whole set of binaries to compare our methods with the baselines. In the field of computer security, besides the AUC and F1 score which takes into account both precision and recall, the cost-sensitive loss, wherein we consider the fact that the misclassification of a vulnerability as a non-vulnerability is more severe than the converse, is also very important. The experimental results on the two datasets are shown in Table 2 and 3. It can be seen that our proposed method outperforms the baselines in all performance measures of interest including the cost-sensitive loss, F1 score, and AUC. Especially, our method significantly surpasses the baselines on the AUC score, one of the most important measures of success for anomaly detection. In addition, although our proposed DCKM aims to directly minimize the cost-sensitive loss, it can balance between precision and recall to maintain very good F1 and AUC scores. In what follows, we further explain this claim. Discovering the trend of scores and number of data points in the strip during the training process Fig. 5 shows the predictive scores and the number of data examples in the parallel strip on training and valid sets for the binary dataset from six open-source projects across the training process. It can be observed that the model gradually improves during the training process with an increase in the predictive scores, and a reduction in the amount of data in the strip from around 1,700 to 50. The tendency of predictive scores when sliding the decision hyperplane in the strip formed by H −1 and H 1 By minimizing the cost-sensitive loss, we aim to find the optimal hyperplane which balances precision and recall, while at the same time maintaining good F1 and AUC scores. Figure 6 shows the tendency of scores and cost-sensitive loss when sliding the decision hyperplane in the strip formed by H −1 and H 1 . We especially focus on four milestone hyperplanes, namely H −1 , H 1 , the hyperplane that leads to the optimal F1 score, and the hyperplane that leads to the optimal cost-sensitive loss (i.e., our optimal decision hyperplane). As shown in Fig. 6 , our optimal decision hyperplane marked with the red stars can achieve the minimal cost-sensitive loss, while maintaining comparable F1 and AUC scores compared with the optimal-F1 hyperplane marked with the purple stars. Binary software vulnerability detection has emerged as an important and crucial problem in the software industry, such as the embedded systems industry, and in the field of computer security. In this paper, we have leveraged deep learning and kernel methods to propose the Deep Cost-sensitive Kernel Machine for tackling binary software vulnerability detection. Our proposed method inherits the advantages of deep learning methods in efficiently tackling structural data and kernel methods in learning the characteristic of vulnerable binary examples with high generalization capacity. We conducted experiments on two binary datasets. The experimental results have shown a convincing outperformance of our proposed method compared to the state-of-the-art baselines. Symbolic execution for software testing: three decades later An optimized cost-sensitive SVM for imbalanced data learning Support-vector networks Toward large-scale vulnerability discovery using machine learning Robust cost sensitive support vector machine Convolutional neural networks for sentence classification Auto-encoding variational Bayes Distributed representations of sentences and documents Gogp: fast online regression with Gaussian processes Maximal divergence sequential autoencoder for binary software vulnerability detection Dual space gradient descent for online learning Robust support vector machine A unified model for support vector machine and support vector data description Gogp: scalable geometricbased Gaussian process for online regression VulDeePecker: a deep learning-based system for vulnerability detection Cross-project transfer representation learning for vulnerable function discovery Large-scale online kernel learning with random feature reparameterization Kernel-based semi-supervised learning for novelty detection Random features for large-scale kernel machines Deep one-class classification Estimating the support of a high-dimensional distribution Attention-based bidirectional long short-term memory networks for relation classification This research was supported under the Defence Science and Technology Group's Next Generation Technologies Program.