key: cord-0609349-86lylgic authors: Berger, Harel; Hajaj, Chen; Mariconti, Enrico; Dvir, Amit title: MaMaDroid2.0 -- The Holes of Control Flow Graphs date: 2022-02-28 journal: nan DOI: nan sha: b77053a3cc4d69abf6efb907d073b531ed89899a doc_id: 609349 cord_uid: 86lylgic Android malware is a continuously expanding threat to billions of mobile users around the globe. Detection systems are updated constantly to address these threats. However, a backlash takes the form of evasion attacks, in which an adversary changes malicious samples such that those samples will be misclassified as benign. This paper fully inspects a well-known Android malware detection system, MaMaDroid, which analyzes the control flow graph of the application. Changes to the portion of benign samples in the train set and models are considered to see their effect on the classifier. The changes in the ratio between benign and malicious samples have a clear effect on each one of the models, resulting in a decrease of more than 40% in their detection rate. Moreover, adopted ML models are implemented as well, including 5-NN, Decision Tree, and Adaboost. Exploration of the six models reveals a typical behavior in different cases, of tree-based models and distance-based models. Moreover, three novel attacks that manipulate the CFG and their detection rates are described for each one of the targeted models. The attacks decrease the detection rate of most of the models to 0%, with regards to different ratios of benign to malicious apps. As a result, a new version of MaMaDroid is engineered. This model fuses the CFG of the app and static analysis of features of the app. This improved model is proved to be robust against evasion attacks targeting both CFG-based models and static analysis models, achieving a detection rate of more than 90% against each one of the attacks. Abstract-Android malware is a continuously expanding threat to billions of mobile users around the globe. Detection systems are updated constantly to address these threats. However, a backlash takes the form of evasion attacks, in which an adversary changes malicious samples such that those samples will be misclassified as benign. This paper fully inspects a well-known Android malware detection system, MaMaDroid, which analyzes the control flow graph of the application. Changes to the portion of benign samples in the train set and models are considered to see their effect on the classifier. The changes in the ratio between benign and malicious samples have a clear effect on each one of the models, resulting in a decrease of more than 40% in their detection rate. Moreover, adopted ML models are implemented as well, including 5-NN, Decision Tree, and Adaboost. Exploration of the six models reveals a typical behavior in different cases, of tree-based models and distance-based models. Moreover, three novel attacks that manipulate the CFG and their detection rates are described for each one of the targeted models. The attacks decrease the detection rate of most of the models to 0%, with regards to different ratios of benign to malicious apps. As a result, a new version of MaMaDroid is engineered. This model fuses the CFG of the app and static analysis of features of the app. This improved model is proved to be robust against evasion attacks targeting both CFG-based models and static analysis models, achieving a detection rate of more than 90% against each one of the attacks. Malicious software, a.k.a malware, is defined as a file or program that tries to damage the normal activity of a digital device. Most malware are specifically engineered to the operating system (OS) they target, as OSs tend to vary based on their hardware and functionalities. One of the popular targets of malware is the Android OS. Android application Packages (APKs), the popular executable files of Android OS, can be found in many Android markets around the world (e.g., Google play store [1] ). Many malware apps can be found on these markets that target the attention of unsuspecting users to assure downloading of the malicious app. For example, the Etinu malware leverages the fear of the recent COVID-19 in Asia [2] . This malware stole information from incoming SMS messages, made purchases in the victim's name, and infected more than 700K users. As a safeguard for these users and many more, a vast amount of researchers and cyber experts are looking for a satisfying solution for the correct identification of malware applications [3] , [4] , [5] , [6] , [7] , [8] , [9] , [10] , [11] , [12] , [13] , [14] , [15] , [16] , [17] , [18] , [19] , [20] , [21] . Several studies were conducted throughout the years to mitigate the threat of Android malware, implementing methods like using heuristics of app structure and signatures [17] , [18] , permissions' analysis [10] , [19] , [21] , [16] , [15] . Hitherto, the most popular approaches to Android malware detection are Machine Learning and Deep Learning [3] , [4] , [11] , [13] , [5] . A strong ML classifier is based, among others, on the raw data that must represent the domain correctly. Specifically, for different sub-domains of malware detection, there is a different real-world distribution of malicious and benign samples. For example, in the subdomain of URL malware detection, most Internet addresses are considered malicious [22] , [23] . In the paper domain, Android malware, most of the apps are considered benign [24] , [25] , [26] . A recent and popular study defined the advisable ratio as 90/10 between benign and malicious Android apps [26] . As a result, the challenge for the Android malware classifiers is to analyze the important characteristics of malicious apps, despite the low volume of malicious apps in the dataset. Even an ML classifier trained on the right amount of benign and malicious apps is not 100% accurate on any input sample. It was proven by Goodfellow et al. [27] that some ML classifiers are susceptible to manipulations. These manipulations are called adversarial examples. Adversarial examples are created when an attacker manipulates malicious samples so that the sample will be misclassified as benign and vice versa [28] , [29] , [30] . In turn, evasion attacks are intelligent attacks in which the adversary manipulates malicious instances, such that they will be wrongly classified. Evasion attacks that change the physical malicious instance are called problem-based evasion attacks. On the other hand, evasion attacks that manipulate the extracted feature vector of an instance are called feature-based evasion attacks. To address the threats that arise from evasion attacks on APKs, this work follows one of the well-known Android malware detection systems, MaMaDroid [11] . MaMaDroid is based on Control Flow Graph (CFG) [31] , where a CFG is a representation of a program using graph notation. This representation depicts all paths that might be traversed through the program while executing it. The contribution of this paper is threefold. First, a full evaluation of the portion of instances on the training set that is benign on the classifier's detection rate is presented. Second, three innovative problem-based evasion attacks against Ma-MaDroid are introduced. The evasion attacks are thoroughly evaluated based on multiple metrics (e.g., robustness), and a varied set of models. Finally, insights from this analysis motivated the creation of a new version of MaMaDroid, MaMaDroid2.0. The new version incorporates an extended feature set. MaMaDroid2.0 was evaluated against our novel evasion attacks and other known evasion attacks. Compared to the original MamaDroid model [11] which results in an evasion robustness of less than 30%, the new model results in an evasion robustness of more than 90%. The remainder of this paper is organized as follows. First, related work is presented and discussed in Section II. Ma-MaDroid, the targeted system, is presented in Section III, together with the attacker models. In Section IV, the attacks are presented. Next, in Section V, metrics and evaluations are discussed. The new version of MaMaDroid, MaMaDroid2.0, is explored in Section VI. A discussion about the evasion attacks and mitigation techniques can be found in Section VII, along with the conclusion of the paper. Table I is an abbreviations table. II. RELATED WORK In Section II-A, four main approaches in the field of Android malware detection systems are described. Additionally, evasion attacks leverage weak spots in ML-based Android malware detection systems. Evasion attacks take two forms: problem-based attacks and feature-based attacks. Problembased attacks [32] , [33] , [34] , [35] , [36] , [37] , [38] include modification of the samples. This is the type of attack implemented in this study. Feature-based attacks [39] , [40] , [41] , [42] , [43] , [44] , [45] map the sample into a feature vector and modify the values of the features. Feature-based attacks are easier to implement than problem-based attacks. The feature-based attacks can be generated automatically by an ML [46] , [47] , [48] , [49] , [50] . Still, the correlated code in the sample that needs to be changed according to these attacks may severely damage the functionality of the sample [51] , [52] , [53] , [54] . Therefore, feature-based evasion attacks are less realizable. This study implements problem-based evasion attacks. Therefore, a review of significant works on problembased evasion attacks on Android malware detection systems is presented in Section II-B. This section surveys well-known Android malware detection systems following four main approaches. The first approach is static analysis, which gathers significant strings from the Smali code files and Manifest file. The second strategy is based on the control flow graph (CFG) of the application, which traces the order of the API calls used in the app. The behavior of the app is inspected in the third approach, which gathers information on the behavior of the Android OS during the run of the app, along with network packets sent and received, etc. The last approach analyses bytecode sequences. Several Hybrid systems are discussed as well. Probably one of the honored Android malware detection systems using static analysis is Drebin [4] . Drebin gathers 8 types of static features, with a specification of two main components of the APKs. Drebin extracts permissions requests, software/hardware components, and intents from the Manifest file. From the Smali code, it extracts suspicious/restricted API calls, used permissions in the app's run and URL addresses. Sec-SVM [35] is an improved version of Drebin, using a more evenly weighted learning model. Other versions of Drebin include a Factorization Machine [55] and a DNN [56] . The DroidAPIMiner [3] , [57] is similar to Drebin, in means of static analysis and feature types. The API calls and permissions are the feature set of this detection machine. The authors of this research performed a dataflow analysis for frequently used API values and package names. The static analysis detection systems look for several features from the APK. Enumeration of these features is easy and efficient. However, since most of the static features can be easily understood, they can also be automatically manipulated without many efforts, like in [35] , [58] , [59] , [37] . A more robust approach is found in MaMaDroid [11] , which extracts the Control Flow Graph (CFG) of an application as a base for its feature set. MaMaDroid generates a tree of API calls based on family and package names. Then, the detection system analyzes the API call sequence performed by an app by each mode -family or package, to model its true nature. A similar approach to detect malicious third-party use in apps was used by Backes et al. [60] . Function and app profiling, such as types and parameters, were used to identify third-party libraries and different versions of the same library. Zhiwu et al. [61] used CFG along with data flow to characterize Android apps, along with a CNN model to predict the labels of new samples. An extension was suggested in [62] , where the same technique was used using n-grams to identify malware families. Monitoring the sequence of functions and API calls inside the app seems like a great idea. Changing the flow of the app is more complicated, since the CFG may be complex and full of details. Furthermore, changing the order of API calls of the app may damage the malicious activity of the app. However, several evasion attacks manipulate the flow of the app and succeed in deceiving this kind of detection system, such as [51] , [63] , [58] , [64] , [65] . The third approach tries to map the traces of a malicious app, thus acknowledging the behavior of such apps, using several operating systems and communication features, as was introduced in Andromaly [12] . A similar approach was taken by Shabtai et al. [14] . These detection systems focused on network usage by measuring the network traffic patterns in a host device running an app. The authors learned the statistics of network packets a user sent and received, the RTT, etc. Another research using a similar strategy is Shabtai et al. [66] . The authors aggregate data during an app run, including user interactions like clicks and touches and OS behavior such as the CPU and network usage. Saracino et al. [67] found the correlations between features at four levels: kernel, application, user and package, to identify malicious applications. Wang et al. [68] explored the number of pages on virtual machines, the change of states between tasks, etc. Behavioral detection methods are based on the nature of the Android device while running various apps. These behaviors may depend on the app and are therefore hard to generalize. Therefore, they are not a complete solution to malware detection. Bytecode inspection is the last approach to Android malware detection. Dalvik Bytecode Frequency Analysis [69] is one example, which looks for popular Dalvik Bytecode instructions of malware apps. Sequences of Dalvik Bytecode instructions were also explored by TinyDroid [70] . The authors of this system gathered families of Bytecode instructions under a single symbol and used n-grams [71] to create the feature set. Yang et al. [72] extracted the bytecode file from the APK and converted the Bytecode to a matrix. This matrix was analyzed by a CNN. Bytecode inspection is a heavy method and ambiguous for the human eye, as it is not a convenient programming language. A hybrid detection system was suggested by Martín et al. [73] . This system fused the static and dynamic analysis of Android apps. The authors combined the transitions between execution states (dynamic) and the inspection of API calls (static). A combination of static analysis of permission requests and dynamic testing of their derived API calls was introduced in [74] . MARVIN [75] inspects the nature of Android apps through static analysis of their design, certificates, etc. In addition, a dynamic analysis of behavioral activity is processed. The combination of static analysis of permissions and intents and dynamic analysis of network traffic was introduced by Ding et al. [76] . A hybrid solution seems the best option to identify malicious apps. However, each method that is added as another layer of processing consumes time and resources from the host device. This section targets the problem-based evasion attacks against malware detection systems. Problem-based evasion attacks are categorized into three forms of attacks. The first uses camouflage to conceal incriminating strings and values contained in the app, by encryption and obfuscation. Next, the second incorporates noises to the app; e.g., uncalled functions. At last, the third form tries to alter the behavior of the app. It reviews the flow of the original app and manipulates the code of several function calls. The first course of action in evasion attacks is to conceal specific suspicious components of the app. One well-known example of a concealment effort is with the help of encryption or obfuscation. Demontis et al. [35] obfuscated suspicious strings, packages, and API calls. Another example of concealment is packing an app inside a fellow app. DaDidroid [63] explored a similar approach as Demontis et al. [35] using packing and obfuscation. Reflection allows a program to change its nature at runtime. It is another classic evasion technique. Rastogi et al. [37] presented an attack, which mixes the Demontis et al. [35] obfuscation method with the addition of the reflection approach. Another form of problem-based evasion attack includes adding noise to the app. These noises mislead the classifier's labeling process. An example of noise addition can be a stub function/code injection. A stub function is a non-operational function, that does not do anything. However, it changes the original flow of an app. An example of a stub function addition is Android HIV [51] , where the authors implemented noninvoked suspicious functions against the Drebin classifier and a stub function injection against the MaMaDroid classifier. A recent example of this kind of attack is at [52] , where the authors implanted benign snippets of code in malicious apps to evade the Drebin and Sec-SVM [35] classifiers. Rosenberg et al. [77] generated an evasion attack against Android malware detection systems using API call manipulation. Three methods were used: Addition of non-operational functions to the application, obfuscation of strings, and encoding of short API calls. Cara et al. [78] added non-invoked classes to the end of functions to a. The last approach is changing the app flow. One of the ways to implement this approach is by function outlining/inlining. In function outlining, the attacker breaks a function into smaller code snippets. In function inlining, the adversary replaces a function call with the entire function body. This technique was implemented in Droidchameleon [37] , which incorporated function outlining in its evasion attacks. Another option to break the app flow is stub function, as in [51] , [64] , [65] , resulting in an ML misclassification of a malicious app. In this section, the targeted system MaMaDroid [11] , [79] is presented, followed by the attacker models. MaMaDroid is an Android malware detection system, introduced in 2017 by Onwuzurike et al. [11] . This detection system extracts features from the Control Flow Graph (CFG) of an APK sample. It enumerates abstracted API calls to capture the behavioral model of the app. MaMaDroid operates in two modes: family and package mode. For example, the API call android.util.Log->d() is abstracted as android. in the family mode, and android.util in the package mode. Packages or families, which are defined by the app's developer or obfuscated are abstracted as self-defined and obfuscated, respectively. As the structure of the evasion attack is similar for both modes, and the family mode results in lower processing time and memory, the family mode was chosen for this analysis. The structure of the evasion attack is similar for both modes. MaMadroid creates the features for the learning algorithm in the following manner: First, it extracts the CFG from the app. Then, it gets the sequences of API calls. Next, the APIs are abstracted using one of the modes. At last, it constructs a Markov chain [80] , [81] , [82] , with the probabilities of transition between any family/package. These probabilities are used as the features. For example, androidTojava is the feature that resembles the probability of transition between the android family to the java family. For a full description of the MaMaDroid classifier, see [11] (implementation is available at [83] ). This section describes two attacker models, which depict the embedded knowledge each attack holds. The first model, named as Feature set Access (FA), depicts a gray-box attacker that knows the feature set of the targeted system (as the attacker model in [84] and attack scenario F in [51] ). The second model is the Statistics Access (SA), which is a whitebox attacker, and can access the feature set and the training data (as attack scenario FB in [51] ). Based on the attacker models, a set of evasion attacks is engineered that transfers the embedded knowledge of the defense model to a manipulated malicious APK that will be classified as benign. The idea behind these attacks is to break and change the structure of the sample so that the detection system would not recognize the manipulated samples as malicious. The three evasion attacks described in this section are variants of a general attack that will be termed the Structure Break attack. However, before the description of the Structure Break (StB) general algorithm, an explanation of the mode elements is provided. These items are vital to each one of the attacks, and specifically engineered for each one of them according to their attacker model. The mode elements are discussed in Section IV-A. Then, the Structure Break (StB) general algorithm is described in Section IV-B. At last, the variants are described in Section IV-C. Mode elements are a subset of the feature set of the specific mode MaMaDroid analyzes. For example, for the family mode, android., java. and xml. are a part of the feature set of the family mode and therefore can be picked for the Mode elements of the family mode. In this work, the mode elements are engineered in three ways: 1) Randomly -Randomly picking a subset of the feature set. Specifically, several families like android. java. and xml can be picked from the family mode feature set. 2) Statistically -Using the statistics of the given data. Given data can be the train and test sets, or just the test set. The given data is analyzed, to produce statistics on the elements. If the training data is given, the elements that will be picked are the ones that hold high values in the benign data, and low values in the malicious data. For example, if the java family's features (java-Tojava,javaToandroid,etc.) in the benign data are high values, and also low values in the malicious data, then java will be picked for the mode elements. The idea is to try and mimic the behavior of the benign samples. If only the test data is given, the least popular elements are chosen. The test data includes malicious samples only, as this work analyzed the effect of evasion attacks against the targeted system. For example, if the java family's features (javaTojava,javaToandroid,etc.) in the test data are in low volume, then java will be picked for the mode elements. The idea behind this pick is to try and move the focus of the extracted features to less popular features. These features are supposed to weigh less and therefore be neglected by the targeted system. The difference between the methods of picking the mode elements is a result of the specific evasion attacks, which will be discussed in Section IV-C. L ← L f unc(0, Height) P ← P f unc(0, 1) 6: f ← random(M ode elements) 7: mkdir(f, AP K T ree) 8: Roots ← get roots random(AP K T ree, P, L) 9: for each r ∈ Roots do 10: r new ← f + r 11: S roots ← get Smali f iles(r) 12: for each f ile ∈ S roots do 13: f ile ← change oc(f ile, r, r new) 14: move(r + f ile, r new + f ile) 15: for each f ile ∈ M anif est, Layouts do 16: f ile ← change oc(f ile, r, r new) 17: AP K ← Repackage(M anif est, Smali...) 18: return AP K 19: procedure CHANGE OC(f ile, r, r new) 20: for each line ∈ f ile do 21: Given malicious APKs, the attacker manipulates the structure of each APK towards the picked mode elements. An algorithm that implements this manipulation is depicted in Algo. 1. The inputs for this algorithm include an APK to manipulate, the mode elements, and L f unc and P f unc. The last two inputs define two functions. L f unc is a function that given the range of 0 to the directories tree's height (of the application), chooses a specific height. P f unc is a function that given a range of change(the default is [0,1]), chooses a ratio of change (for the specific level chosen by L f unc). As there are two options for each function depending on the specific attack variant, they are defined as variables. The algorithm implements the following steps: (1) The algorithm's inputs are an APK file, the Mode elements, L f unc and The attacker creates a new full functional APK, as it changes the structure of the Smali code files, the Manifest file, and the layout files (lines [12] [13] [14] [15] [16] . As all of these files might include some occurrences of the part that changed (in other words, references to files that moved from their original places), these occurrences need to be changed accordingly. A small example of this algorithm is provided for clearance. Let's AP K be an APK that includes the following: 1) A Manifest file 2) 2 Layout Files named L1 and L2. 3) A Smali code directory which has the following structure: • A root directory named com. -A subordinate directory with the name tb. * A Smali code file named x1.smali. * A Smali code file named x2.smali. -A subordinate directory with the name xz. Let's mode elements = android, java, L f unc = 1, P f unc = 0.5. For this example, the chosen part for change is the tb folder. The element f is chosen to be android. The structure of the output AP K is now: 3) A Smali code directory which has the following structure: • A root directory named com. -A subordinate directory with the name xz. • A root directory named android. -A subordinate directory with the name tb. * A Smali code file named x1.smali. * A Smali code file named x2.smali. Each occurrence of the changed part (names of tb directory and its subordinate files) is changed in the Manifest files and L1 and L2. The previous section IV-B described the general StB algorithm. This section describes the actual evasion attacks, which are variants of the StB general algorithm. Each variant is described by its attacker model, M ode elements, L f unc and P f unc. The experiments include an analysis of an extended set of models of MaMaDroid [11] . First, the set of distancebased models of MaMaDroid is used (1-NN and 3-NN) and extended by a 5-NN model. Second, the tree-based models are tackled in a parallel way. RF is the only tree-based model that was originally used. As RF is an ensemble of decision trees (DTs), the basic DT model is included. On the other hand, a boosting model (i.e., AdaBoost), which is a more sophisticated ensemble of DTs, is added. In contrast to the RF, where the final decision is based on the decision of each DT independently, in AdaBoost, each DT is aimed to focus on the wrong classification of the prior ones. In total, six models were explored. For each evasion attack, an experiment was conducted using these six models (1-NN, 3-NN, 5-NN, DT, RF, Adaboost). The experiments were run on an Intel(R) Xeon(R) CPU E5-2683 v4 2.1 GHz with 64 GB RAM with GeForce RTX 2080 TI GPU. The dataset for the experiments consisted of ∼73K benign apps from the Google Play Market [1] (obtained from Androzoo [85] ) and ∼6K malicious apps from the Drebin dataset [86] , [4] , [5] , [87] , [88] , [89] , [90] . To account for variations in the dataset, a 5-fold CV was used. To test the changes in the detection rate of each of the models, different ratios of benign and malicious apps were used during the experiments. The test data without any manipulations will be termed clean data. The post-manipulation test data will be referred to as manipulated data. To fully evaluate the evasion attacks, an additional experiment on clean data was used as a baseline. To clarify the effect of the amount of benign data on the decision function of each model is tested, the malicious apps are fixed for each experiment, while the benign ratio changes (10%,20%,. . . ,100%). Evasion Robustness (ER): To evaluate robustness, the portion of malicious instances which was wrongly classified was computed (similar to the analysis provided in [91] ). The TPR of malicious apps was used as this evaluation metric, both for manipulated apps and clean malicious data. Defense Reciprocal Rank (DRR): A new metric to evaluate the strength of an evasion attack was presented in [92] . The intuition behind the Defense Reciprocal Rank (DRR) is that a correct classification is not the only factor of an evasion attack. The confidence of the classifier matters as well. For example, lower confidence may indicate that the attack, though failed to fully deceive the classifier, gained some effect on the classifier and can be more easily improved into a successful attack. Each of the classes (e.g., benign and malicious) is given a rank, based on popularity among the data, target class, and the other class, or any other way of ranking.P i represents the probability assigned to the true class i, and R i is the rank of that class within the ordered predictions list. For the chosen class by the classifier, theP i 's range is between 0-1. For the second-best class,P i 's range is between 0-0.5, as if it was more than 0.5, it was the chosen class. The other classes will follow a similar rule. The ranks in the case of a binary classification are 1 and 2, where 1 is the highest rank. The ranks in this paper will be picked as 1 to be the right class of the sample, and 2 for the wrong class. Each rank will have a range. The ranges should not overlap as well, asP i will be mapped to a R i 's range. The range of the first rank is between 0.5-1. The second rank will be between 0.33-0.5, and so on. The DRR is calculated in Eq. 1, for a sample x: The first element maps the range ofP i to a range in the size of the R i 's range. The second element adds the lower bound of the R i 's range. A sum of both mapsP i to the actual range of R i . Therefore, the overall DRR of a classifier CL on a set of samples X is defined in Eq. 2: Model Reliability Assessment: As the process of a binary classification goes, for each test sample, the model outputs a probability of the target class. The second class holds the complement probability. For each test sample, the entropy of the target class and second class is denoted as s. In other words, s = E(P, P ), where E symbolizes the entropy function, and P ,P are the probabilities of each class produced by the classifier. To fully evaluate the magnitude of each model, the Shannon Entropy of the probabilities is used, to generate a number that represents the reliability of the model, similarly to the approach suggested in [93] . The reliability is based on the complement of the average entropy of a test sample. As entropy usually measures uncertainty and disorder, the reliability is calculated as the complement of the average entropy. That way, a higher score is translated as great confidence by the classifier. The reliability is defined with regards to the classifier CL, the set of test samples X and the set of entropy S for the test set X using Eq. 3: The three metrics create an advanced view of the effects of the evasion attacks. The ER metric depicts the gap in detection rates of malicious samples, between clean and manipulated apps. The DRR metric describes the effect the samples have on the certainty of the prediction function of each detection model. The Model Reliability metric concludes the analysis with the overall reliability of each detection model. As will be exampled in the next section, the following scenario can happen between two models: The first model is more affected by the evasion attacks in means of DRR compared to the second. However, the first model is more reliable in facing evasion attacks. Therefore, the DRR and Model Reliability metrics complete one another by means of a thorough analysis of the effects of the evasion attacks. The first evaluation analyzes the effect of the benign ratio on the ER. The ER metric was computed using the TPR of the malicious apps as the original detection rate was computed. Naturally, as one can expect, as the ratio of benign samples in the train set increases, the classifier will tend to focus on these samples, aiming to minimize the loss function, and the TPR will monotonically decrease. The results of the ER are split into four parts 1 , depicted in Fig. 1 . First, the ER of the clean data is presented as a baseline in Fig. 1a . Second, the ER of the Random StB attacks is described in Fig. 1b . Then, the ER of the Full Statistical StB attack is presented in Fig. 1c . At last, the ER of the Black Hole Statistical StB attack is depicted in Fig. 1d . a) Clean data: The results of the ER of the clean data can be seen in Fig. 1a . It can be seen that the more benign apps used, the fewer malicious apps are detected by each one of the models. The 1-NN and 3-NN models' starting points at 10% of the benign apps (blue and red lines) are lower than the RF (green line), by ∼30%. Their ending point with 100% of the benign apps reaches almost 0%. In other words, the RF's detection rate decreases by ∼ 50%, and the KNNs by ∼ 40%. The DT's (black line) starting point at 10% of the benign apps, is a detection rate of ∼90%. Moreover, this model sustains a stable 1 extended version of these results and the results of the evasion attacks can be found in Github:https://github.com/harelber/The-Black-Holes-of-Markov-Chains. detection rate. The Adaboost model (red line) shows similar detection rates to the RF, 1-NN and 3-NN, with a slight superiority over the RF model. The 5-NN model (cyan line) is less accurate than all models. It seems that the DT is the best model out of the six. b) Random StB attack: The results of the ER evaluation of the manipulated data using the Random StB attack are depicted in Fig. 1b . The distance-based models (1-NN, 3-NN, and 5-NN) show an interesting behavior. They have a similar detection rate along the interval of the benign ratio between the clean data and the manipulated data. In other words, the attack did not have a great effect on them. The RF model suffers from a great loss of ∼40% along the interval. In other words, the attack affected the RF model dramatically. This is a change of course, as in the case of the clean data, the RF was more accurate than the 1-NN and 3-NN. The DT and Adaboost models have a similar distribution as with the clean data. However, the starting and ending points of each model in both of them are lower than the correlative points in the clean data's results. The DT model is stable and has the most effective results against the Random StB attack. However, it has an ER of less than 50% with only 10% of the benign apps. It falls under 40% with 100% of the benign apps. The Adaboost model is now as effective as the 1-NN model. c) Full Statistical StB attack: In this case, which is depicted in Fig. 1c, Fig. 1d . In this case, the distance-based models stay the same as before. The ER of the tree-based models (DT, RF, Adaboost) continue their decrease in ER, and in this case -less than 10% on most of the interval. The distance-based models show poor results with the clean data and against the evasion attacks as well. In comparison, the tree-based models show high ER with clean data. Also, the tree-based models were proven to be more susceptible to attacks than the distance-based models. An explanation for these findings is based on the split nodes function of the tree models. The node split function is effective in means of identification of malicious activity, without any manipulation. The more features it processes, the more accurate it becomes. This is true for the DT model, as the amount of benign data did not dramatically affect its detection rate. However, splitting the data between multiple weaker modules, like exampled in the Adaboost and RF models, results in a decrease in the ER. Also, the tree-based models are based on features that are often manipulated by the attack. Therefore, the ER changes according to the attacks. However, as most of the features stay similar to the clean data, the distance-based models show a similar detection rate between the baseline and the evasion attacks. The second evaluation is of the DRR. The DRR measures how effective an evasion attack is. A higher DRR means a stronger classifier. In addition to the ER, this metric reflects the gap between the predictions on clean data and the predictions on evasion attacks. The ranks chosen for this experiment were 1 for malicious and 2 for benign, as the test data consisted of only malicious data. Therefore, the rank of 1 depicts the correct label, and a rank of 2 for a mistake. In this section, a review of the results of the models tested on clean data, and the three evasion attacks is presented. The clean data results are depicted in Fig. 2a . The DRR of the evasion attacks are presented in Fig. 2b-2d. 1) Clean data: The results of the DRR evaluation of the clean data are depicted in Fig. 2a . As with the case of the ER, the tree-based models overcome the distance-based models, considering data without any evasion attack. Also, the DT model holds the highest DRR, which depicts the most resilient model. Its DRR is more than 90% along the interval. 2) Random StB attack: In Fig. 2b , the DRR evaluation of the manipulated data using the Random StB attack is depicted. The DRR of the distance-based models stays the same as in the clean data. Fig. 2d . The distance-based models, the DT and Adaboost models present identical DRR to the Full Statistical StB attack. The starting point of the RF is lower than its correlative starting point in the Full Statistical StB attack. The distance-based models showed similar DRRs in each case explored, as in the ER metric. The evasion attacks had a similar effect on the DRR of the Adaboost model, in contrast to the DT model, which decreased by 50% in the case of the Random StB attack, and an additional 20% in the Full and Black Hole Statistical attacks. These findings raise the need to investigate if the Adaboost model is more effective than the DT model and if the Adaboost model is more reliable. The next section on the Model Reliability metric will answer these inquiries. The reliability of each model was inspected, to understand the confidence of each model on the predictions. High reliability means that the gap between the two classes is great. A low value means that the classes are close to being indistinguishable. This does not influence the ER, as a classifier with a low detection rate can be "confident" in its predictions. Also, a "lucky" classifier can have a high detection rate, but the probabilities of each class may be close. The assessment was done on four cases -clean data and the manipulated data by the three evasion attacks -to compare the changes the attacks have on the classifiers. The assessment is depicted in Fig. 3 . As the reliability assessment results are similar between the four cases -clean data and manipulated data -they will be discussed without specific references to each one of the subfigures. It can be seen that the 1NN model is reliable, as its reliability is 1 on the whole interval, for each one of the cases. The reliability of the DT model is also found to be very high, as its reliability receives the value of 1 along whole the interval for each one of the cases. The reliability of the Adaboost model was found to be a constant value of 0.3 along the interval, for each one of the cases. The Model reliability rates of the other models -3NN, 5NN, and RF are increasing throughout the interval. The increasing reliability of the 3NN and 5NN models is similar in the four cases. The RF model is found to be more reliable in the cases of evasion attacks than the clean data. In other words, it is more "confident" in its labels of the evasion attacks than the clean data. Both the 1NN and DT models show a constant high value of model reliability, with a value of 1. In contrast, the Adaboost model has a constant value of 0.3. As these phenomena were surprising, a close examination of the probabilities of each class for these models was done. It was found that for the 1NN and DT models, the probabilities are binary for each sample, without any regard to the benign ratio of the apps. For the Adaboost model, both probabilities on each sample were close to 0.5. Therefore, the entropy is high in the results of the Adaboost model, and low in the results of the 1NN and DT models, which outcomes in correlative complement Model Reliability rates. Therefore, it was found that although the Adaboost model had better DRR than the DT model against the evasion attacks, its reliability is far worse. The DT model was found as the most promising model out of the six, based on the combined insights of the ER, DRR, and Model Reliability metrics. Four important insights are derived from the combination of the results of the ER, DRR, and Model Reliability: 1) The most promising model tested on the clean data and the evasion attacks is DT. However, this model is susceptible to evasion attacks. 2) The distance-based models were not much affected by the evasion attacks. However, their detection rate is low in the baseline and any evasion attack. 3) The ER and DRR results proved that the Random StB attack is less effective than the Full Statistical and the Black Hole Statistical attacks. For example, for the DT model, the DRR of the clean data was 80%-90%. On the Random StB attack, the DRR was 50%-60%. However, the DRR decreased to 40% and less in the cases of the Statistical StB attack and Black Hole Statistical attack. Also, the ER supports this conclusion. This insight is pretty intuitive as a random attack is not specifically targeting the data the model is built upon. In comparison, the Full Statistical StB attack requires the train data to generate the attack. However, it was important to establish this insight by viewing the results. In addition, the Black Hole Statistical StB attack got similar results to the Full Statistical StB attack but needs less information -only the test data. Therefore, it can be considered the best attack out of the three. 4) A lower DRR does not necessarily means a less effective model. As proven, the DT model had a lower DRR in the case of the Full and Black Hole Statistical StB attacks than the Adaboost model, which had a similar DRR for each attack. However, the ER and Model Reliability showed the full picture, in which the Adaboost model is less reliable and robust than the DT model. A more robust version of MaMaDroid can be suggested, due to the effect of the ratio of benign apps on the detection of malicious apps and evasion attacks. This version should use the most effective model that was found -DT. However, using the existing DT model is not enough, as it is susceptible to StB attacks. The next section will demonstrate a suitable solution for the holes in the model. As suggested in the previous section, DT was found to be the most effective and reliable model for MaMaDroid. However, the DT model was found to be vulnerable to StB attacks. The distance-based models were found to be robust against the evasion attacks, but their original detection rates were low. A stronger version of MaMaDroid can be suggested due to these findings. As even the strongest model, DT was evaded by the StB attacks, there is still room for improvement. An extension of the feature set might help to mitigate the attacks, thus creating a hybrid approach. One of the influential works on Android malware detection is Drebin [4] . This work laid the foundations for ML Android malware detection which is based on static analysis. This system is based on eight feature sets. For the extension of MaMadroid, one of the feature sets was picked -the required permissions set. Permissions are a great candidate for a feature set of Android malware detection systems over the years, as a full feature set of a detection machine or a part of it (i.e. [94] , [95] , [76] , [10] , [96] , [52] , [97] , [16] ). Therefore, the required permissions set was picked to enhance MaMaDroid. It will be termed as permission set. An analysis of the permission set of the dataset (the same dataset as Section V) resulted in several insights. First, using the full permission set from the dataset will increase the size of the feature set. Second, the less frequent permissions will not affect the detection process as their weight will be low. Moreover, an analysis of the rare permissions found that they were custom permissions [98] used on a few apps. Therefore, only the frequently requested permissions in the dataset were included in the permission set, in over 10% of the data, both benign and malicious 2 . MaMaDroid 2.0, the enhanced version of MaMaDroid, follows the following steps: 1) Run MaMaDroid1.0's feature extraction on the app to get MaMaDroid1.0's feature set. 2) Extract the permission set from each app by exploring its Manifest file. Filter the less frequently requested permissions. 3) Merge the two feature sets. 4) Run the model on the merged feature set and get a classification of the app. The following sections include the experimental design of MaMaDroid2.0 (Section VI-A), and the results (Section VI-B) Several experiments were done to assess the power of MaMaDroid2.0. The dataset for these sets of experiments consisted of the dataset from Section V, using 100% of the benign apps. As will be seen in the results, using 100% of the benign apps does not damage the effectiveness of MaMaDroid2.0, concerning the classification of benign and malicious apps. This set of experiments tested the effectiveness of the new detection machine, both on clean data and manipulated data. It was done to see if the addition of features damaged the correct classification of benign or malicious data. In each experiment, the DT model was tested, which was proven to be the best model in Section V. Each test case consisted of benign and malicious data, where the malicious data changed according to the specific case. The model was tested in terms of f1-score and recall, to test its correct detection as a whole, and specifically the malicious samples. As most of the test set is benign, it is pretty easy to detect benign apps as benign, using the trivial labeling of the whole test data as benign. Therefore, the recall metric was used, to identify the correct classification of malicious samples. The F1 metric was used to see the effect of the precision and recall together, so as to see the false positives rate on the new machine along with the recall. Furthermore, as MaMaDroid2.0 incorporates the permission set as part of its feature set (as used in Drebin [4] ), attacks against Drebin [5] were tested as well, to see if the new machine is strong against these attacks as well. Each experiment is described by the test data and feature set, as the train data stays the same. The settings are described in Table II in Section VI-B. The results of the first three experiments of MaMaDroid2.0 are presented in Table III . It can be seen that the recall of the baseline of DT, as was seen in the results of the former experiments (Section V), is 0.84 using clean data. Additionally, the recall of the StB attacks is lower than 0.3. However, adding the permission set thus using the enhanced feature set, raises the recall of each case to 0.9 or more. The f1-score of each case using this extended feature set is more than 0.96. In comparison, in the baseline, the f1-score is 0.89 with the clean data and less than 0.66 with the manipulated data. Using only the permission set achieves a close recall and f1-score rates to the enhanced feature set of MaMaDroid2.0. These results may raise the question: What is the importance of the MaMaDroid1.0 feature set if the permission set alone can be used as a standalone to identify evasion attacks? The answer lies in Table IV . The MB attacks show a great impact on a model that is based on the permission set. The highest recall using the permission set only is against MB1, with a value of 0.49. The other MB attacks achieve recall values of 0. In comparison, the enhanced feature set achieves high recall values, at least 0.97, for each case -clean data and manipulated data. This proves that a model that analyzes only the permission set cannot identify evasion attacks that manipulate permissions. In contrast, MaMaDroid2.0 succeeds in this task. In conclusion, it was proven that the enhanced feature set of MaMaDroid2.0 creates a stronger model. This model was proved to identify attacks against both MaMaDroid and Drebin alike. The original MaMaDroid was based on a specific ratio between benign and malicious applications. The original ratios were between 50%-50%, or with a tendency toward the malicious class. As a recent study suggested [26] , the realistic ratio in the world between benign and malicious apps is leaning toward the benign class. More specifically, it is 90%-10% between benign and malicious apps. As a consequence, the number of benign apps was tested in this research to derive their influence on the ER. As was seen, the ratio did have an impact on each one of the models that were tested. In addition, the StB evasion attacks decreased the tree-based models' ERs. The distance-based models proved to be strong against the attacks, but their detection rates were already low on the clean data (and stayed there on the manipulated data by attacks). The DT model was found to be the best model by means of ER. It was also found to be very reliable, even in the most challenging ratio of more than 90% of the dataset being benign apps. For all of the above, it was suggested to replace the former models of MaMaDroid with a DT. In addition, the merged feature set upgraded the total effectiveness of the model. Adding the enumeration of permission requests, which are a small amount of data, aids the identification of malicious activities. Other additions to the feature set may help in this task, such as suspicious API calls or intents. These directions are left as future work. Most mitigation techniques to evasion attacks try to find the trail, the attacks leave and identify them. With other attacks, the effect may be eliminated by preprocessing. For example, adding no-operation calls to CFG based detection machines, like in [51] . In regards to the StB attacks, the flow of the attack does not add actual code to the app, just a replacement of references to classes names, and moving parts of the application between places. Moreover, in the Full/Black Hole Statistical StB attacks, there is only an addition of a new root directory to the app. A mitigation technique that might succeed in some of the cases is looking for duplication of a name of a directory. For example, if the former root directory of the app is "android" and after the evasion attack now the root directory is also "android", there will be two android directories, one below the other. A quick check for such a case will result in an alert for the StB case. However, not every app's root directory is similar to the new root directory which is picked by the attacker. Therefore, the StB attacks are still a threat to malware detection machines that are based solely on CFG. MaMaDroid is an excellent example of using the CFG to identify behaviors of benign and malicious APKs. In general, a CFG of a software tries to organize the order of commands the software runs. A malware detection machine that is based on a CFG tries to find the differences between the order of commands in malicious and benign apps to address the task of the classification of new samples. As the experiments in this study showed, a shift in the order of commands can cause a miss-classification. For example, the Black Hole StB variant demonstrates the effect of moving some of the commands from the malicious order of commands to an unknown order of commands. In other words, the order turns to chaos. The different variants modify the order of commands in an app, which results in the alteration of its CFG. When the new CFG is inspected by the detection machine, it is puzzled. This course of action can be generalized to other domains as well. Therefore, the CFG-based detection machines are now faced with the following challenge: Are they secure enough against the ominous chaos that will arrive? It seems that an extension of the feature set or a hybrid approach may aid in dealing with this issue. The combination of different features seems a promising step towards more accurate solutions against multiple types of malware that are exploiting different vulnerabilities. This work was supported by the Ariel Cyber Innovation Center in conjunction with the Israel National Cyber Directorate of the Prime Minister's Office. The authors will like to thank Dr. Lucky Onwuzurike for his technical aid with MaMaDroid. GooglePlay app market. GooglePlay website A year of lockdown sees a surge in mobile malware targeting banking, billing and covid-19 vaccines Droidapiminer: Mining apilevel features for robust malware detection in android Drebin: Effective and explainable detection of android malware in your pocket Crystal ball: From innovative attacks to attack effectiveness classifier Towards sustainable android malware detection Stormdroid: A streaminglized machine learning-based system for detecting android malware Madam: a multi-level anomaly detector for android malware A new adaptive learning algorithm and its application to online malware detection On lightweight mobile phone application certification Mamadroid: Detecting android malware by building markov chains of behavioral models andromaly": a behavioral malware detection framework for android devices Detection of malicious code by applying machine learning classifiers on static features: A state-of-the-art survey Mobile malware detection through analysis of deviations in application network behavior Sigpid: significant permission identification for android malware detection Apk auditor: Permission-based android malware detection system A heuristic approach for detection of obfuscated malware Efficient signature based malware detection on mobile devices Exploring permission-induced risk in android applications for malicious application detection Droidmat: Android malware detection through manifest and api calls tracing Permlyzer: Analyzing permission usage in android applications Two years of short urls internet measurement: security threats and countermeasures Exploring the long tail of (malicious) software downloads Android security 2017 year in review Andradar: fast discovery of android applications in alternative markets Eliminating experimental bias in malware classification across space and time Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples Adversarial examples for malware detection Black box attacks on deep anomaly detectors Adversarial examples: Attacks and defenses for deep learning Control flow analysis Generating natural language adversarial examples Evading botnet detectors based on flows and random forest with adversarial samples Evading classifiers by morphing in the dark Yes, machine learning can be more secure! a case study on android malware detection Textbugger: Generating adversarial text against real-world applications Droidchameleon: evaluating android anti-malware against transformation attacks Adam: an automatic and extensible platform to stress test android anti-virus systems Evasion attacks against machine learning at test time Towards evaluating the robustness of neural networks Droideye: Fortifying security of learning-based classifier against adversarial android malware attacks Adversarial attacks on mobile malware detection Multiadversarial discriminative deep domain generalization for face presentation attack detection Towards feature space adversarial attack Formalization of the feature space for detection of attacks on wireless sensor networks Automatic generation of mobile malwares using genetic programming An adversarial machine learning model against android malware evasion attacks Generating adversarial malware examples for black-box attacks based on gan Replacement attacks: automatically impeding behaviorbased malware specifications Unsupervised adversarial attacks on deep feature-based retrieval with gan Android hiv: A study of repackaging malware for evading machine-learning detection Intriguing properties of adversarial ml attacks in the problem space Repackman: A tool for automatic repackaging of android apps Evasion is not enough: A case study of android malware Android malware detection based on factorization machine Adversarial deep ensemble: Evasion attacks and defenses for malware detection Stealth attacks: An extended insight into the obfuscation effects on android malware Mystique: Evolving android malware for auditing anti-malware tools Reliable third-party library detection in android and its security applications Cdgdroid: Android malware detection based on deep learning using cfg and dfg Android malware family classification and characterization using cfg and dfg Dadidroid: An obfuscation resilient tool for detecting android malware via weighted directed call graph modelling Server-based code obfuscation scheme for apk tamper detection Nativeguard: Protecting android applications from third-party native libraries Intrusion detection for mobile devices using the knowledge-based, temporal abstraction method Madam: Effective and efficient behavior-based android malware detection and prevention Android malware detection through machine learning on kernel task structures Android malware classification method: Dalvik bytecode frequency analysis Tinydroid: a lightweight and efficient model for android malware detection and classification Gauging similarity with n-grams: Languageindependent categorization of text Appspear: Bytecode decrypting and dex reassembling for packed android malware Android malware detection through hybrid features fusion and ensemble classifiers: The andropytool framework and the omnidroid dataset Reevaluating android permission gaps with static and dynamic analysis Efficient and comprehensive mobile app classification through static and dynamic analysis A hybrid analysis-based approach to android malware family classification Generic black-box end-to-end attack against state of the art api call based malware classifiers On the feasibility of adversarial sample creation using the android system api Mamadroid: Detecting android malware by building markov chains of behavioral models (extended version) Handbook of markov chain monte carlo Practical markov chain monte carlo. Statistical science Markov chain monte carlo without likelihoods On the use of dgas in malware: an everlasting competition of detection and evasion Androzoo: Collecting millions of android apps for the research community Aspectdroid: Android app analysis system Android malware detection via an app similarity graph R-packdroid: Api package-based characterization and detection of mobile ransomware Byte-level malware classification based on markov images and deep learning Android malware detection based on network traffic using decision tree algorithm Improving robustness of ${$ml$}$ classifiers against realizable evasion attacks using conserved features Evaluation of neural networks defenses and attacks using ndcg and reciprocal rank metrics Ensemble selection based on classifier prediction confidence Permpair: Android malware detection using permission pairs Droid permission miner: Mining prominent permissions for android malware analysis Significant permission identification for machine-learning-based android malware detection Puma: Permission usage to detect malware in android This figure "cfg.png" is available in "png