key: cord-0057676-tp9pmr6f authors: Rami, Khyati; Desai, Vinod title: Malware Detection Framework Using PCA Based ANN date: 2020-06-08 journal: Computing Science, Communication and Security DOI: 10.1007/978-981-15-6648-6_24 sha: 1dd34e949615be1bfde0496dc3808e3daad0796f doc_id: 57676 cord_uid: tp9pmr6f Different kinds of computer threats exist to damage the computer system, and Malicious programs is one of them. Internet can be the main source to spread some threats. Experts continuously detect those which can slow down the system, or totally damage it. Malware creators have always been a step ahead. To detect malware threat, there are two basic approaches, based on signature and heuristic. For accurate and efficient result of malware detection there are detection techniques based on heuristic method. Polymorphic malwares are growing day by day and heuristic method is combined with machine learning to get more precise and effective detection. Malware detection system using data mining and machine learning methods have been proposed by many researchers to detect known and unknown malware. In this paper we present the ideas behind our malware detection framework by PCA based ANN to detect known and unknown malware. To design the proposed framework we have used MATLAB GUI.ANN is used to detect the presence of malware in CSDMC2019 API dataset. The computational time for ANN classifier is less than 0.2 s compared to NB classifier which has a computational time of 0.82 s. Computer threats is created to corrupt confidential information and malicious ways to irritate users, well malware is one of these threats. Malware is increasing at alarming rate to ruin the system. Due to that security incidents is to be grown [1, 2] . Propagation ability of malware is like chain reaction which is dangerous due to none centralized control therefore it is not easy to detect. Malwares are crucial threat to computer security according to studies [3] . Malware intelligent are trying to create program which cannot be traced easily, and time to time they are changing their techniques so malware can be transformed into the malicious code without detection. These simple ideas start first with encryption which go further with oligiomorphic, polymorphic and metamorphic viruses. As per studies existing techniques are found limited therefore combination of the Artificial Intelligence, Machine Learning and Data mining methods are increased efficiency of detection of malware [4] . Signature based detection methods are efficient to detect known malwares but not enough to detect unknown malware and polymorphic malware due to its signature changes nature. Heuristic based detection methods can trace known and unknown malwares but result can be found high rate of false positive and negative therefor it requires to develop detections methods with accuracy. The heuristic based detection techniques are combined with machine learning method to get accurate and efficient result of malware detection, due to alarming increasing rate of polymorphic malwares. So current condition requires for everyone to find better solution. The paper is organized as follows: Sect. 1 describes the introduction, Sect. 2 describe the literature review, Sect. 3 describe the proposed malware detection system and Sect. 4 describes the results. In this research work ANN is used to detect the presence of malware in CSDMC2019 API dataset. We have tested the proposed malware detection system while connecting it to a mobile OS and transferring a file from mobile OS to desktop OS. We use MALTAB GUI to design the proposed malware detection system, whereas in Sect. 5 we give the conclusion. Mariantonietta La Polla [5] surveys the different threats, vulnerabilities and security solutions for more than decade specifically in the period 2004-2014, by focusing on high-level attacks, which are on user applications. We can group existing approaches keeping in mind to protect mobile devices against different classes of attacks into different categories, based upon the detection principles, architectures, collected data and OS, main focus is on IDS-based models and tools. With this categorization, we aim to provide clear and concise view of the underlying model accepted by each approach. Sujithra M. [6] focused on various threats and vulnerabilities that affect the mobile devices and discussed how biometrics could be a solution to the mobile devices ensuring security. These systems are proved highly confidential portable mobile based security systems which is very much required. Comparing various biometric features such as fingerprint, face, gait, iris, signature and voice. Iris is proved the most effective biometric feature due to its reliability and accuracy. We have also reviewed some research papers based on Malware detection for known and unknown malware. In the following table describe the comparison of studied papers (Table 1) . In the above sections, we have presented a brief review about the malware detection and prevention techniques introduced in the past decades. Day by day the malware writers are improving and evolving camouflage techniques from simple encrypted virus to extreme complex and difficult to detect polymorphic and metamorphic viruses. Based on the literature review we have designed malware detection Model for known and unknown malware. A smart host-based system was developed to detect malware on mobile devices and was evaluated. The framework is designed to be light on the system such that it consumes minimum CPU, memory and battery. It continuously samples various features on the device and collects data which is then analyzed using machine learning and temporal reasoning method and the state of the device. The features of the framework are divided into two categories namely Application Framework and Linux Kernel respectively. Features such as Messaging, Phone calls and Applications belong to the former whereas Keyboard, Touch Screen, Scheduling and Memory belong to the latter. The above Framework helps in the detection of malware and in finding the weak points in the Mobile OS. KBTA (Knowledge Based Temporal Abstraction) is normally used for showing the malware behavior in the Mobile OS. The behavior pattern is classified using a Classifier. The data passing through the System is scanned by an Anomaly Detector for incoming anomalies. There are certain preset parameters which are used to dividing the inputs and they are known as Rule Based processes. The above mentioned four processes are used to overcome the malware intrusion. We now exhibit the work process of our proposed ANN-based malware location framework as appeared in Fig. 1 . The entities in Fig. 1 are explained below. Graphical User Interface (GUI) allows the users to interact with electronic devices with the help of graphical icons and visual pointers such as secondary notation instead of Command Line Interfaces (CLI) which required the user to type commands or use text navigation. CLIs were not very user friendly due to the extensive typing of commands to perform simple operations. Hence GUIs were introduced. The GUI used in this Framework was designed using MATLAB. A processor is a logic chip that reacts to and processes the basic instructions that are initiated by the computer. The four basic functions of a processor are fetching, decoding, executing and writing back. This unit is used to extract the useful and required information from the layers. It also contains the necessary hardware and software units in it. The Alert manager handles alerts sent by client applications. It takes care of duplicating, grouping, and routing them to the correct receiver integration such as email, Pager Duty, or OpsGenie. It also takes care of silencing and inhibition of alerts. It is used to determine the results from all active processors and applies an ensemble algorithm to derive a final decision of device's infection level. This interface allows you to edit feature lists, which you can assign to packages that you apply to cPanel accounts. Feature lists provide or prevent access to specific cPanel features. SQL is used to perform operations on the records stored in the database such as updating records, deleting records, creating and modifying tables, views, etc. We might want to stress that the work process is common and can be utilized for authorization based recognition and framework call-based discovery. In disconnected preparing stage, we initially gather true kindhearted and noxious applications. Then, the gathered applications are executed and the information sources are discarded. Utilizing the mapped information as info, we at that point prepared the neural system. In the online identification stage, we dumped the information sources from new applications and the prepared neural system would be utilized to decide if the new application is malware or kind. As authorizations and framework calls contain diverse highlights and have distinctive configurations, we initially present consent based discovery and after that framework call based identification in the accompanying subsections. Step 1: Collecting Information sources and arranging them. The primary step in the standalone preparing stage is gathering information from the running applications. With the help of credible applications and malware tests, the applications of a similar class should give comparable data such that the data can be used in the irregularity profile. Using these profiles we can classify applications as friendly or malicious. Step 2: Discarding Data Source Permissions. With the help of a kind application and corresponding malware tests, the consents asked for by every applications are scrapped. All the consents in an Android Framework are integrated in the AndroidManifest.xml document. For gathering the apk documents a device called Android Asset Packaging Tool (aapt) is used. It helps in recreating the source code and in obtaining the AndroidManifest.xml consents for all applications. Step 3: Feature extraction. A set of files consisting of consents requested by an application are collected. For the training part the data is processed as well as mapped to the prerequisite format given by the ANN. To convert the original consent into system readable input a mapping algorithm was designed. In this algorithm an integer is assigned to each feature and the value assigned defines whether it was called for by the application. An application can request only once for a consent. If a consent is requested, its assigned value is 1, else it is 0. As the ANN acknowledges integers as information, we allocate the consent names to a whole number for preparing the list of authorizations. Outputs such as "01, 02,03,06,09, 15, and 20" are produced by mapping. For example BLUETOOTH is mapped to 12, READ CALL LOG is mapped to 14, and READ CONTACTS is mapped to 8. We can map out this plan to use 2-gram to recognize inclusion by applying two continuous consents in place of one. For instance, we join two consecutive whole numbers and the mapping results are "0102, 0203, 0304, and 0405" where "0102" speaks to the authorizations ACCESS NETWORK STATE and GET ACCOUNTS respectively. After the number grouping, the following stage is to obtain the purpose of each component. It is to be noted that the presence of consent is treated as component esteem. For each element that comes up, its esteem is termed as 1. For those which don't show up, their qualities are termed as 0. After the last two for circles, we get the component vector pointing to the contribution of the ANN as takes after: 1,0,0,0,0,0,0,0,0,0,1,0,1,0,1,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,1,0 Step 4: Classifier learning. The learning module is used in this step and it is used to build up the neural system for analyzing the application conduct from preparing information. The element vectors are input to the Matlab Neural Network Toolbox which is in Matlab R2016a (8.1.0.604) to execute consent based recognition. The quantity of hubs in a shrouded layer is set to 10 and after that 20. The online identification step is similar to that portrayed in the disconnected preparing stage. Ideally, to group an application, the initial step is to dump the consents and guide the authorization list to the organization required by the ANN. The prepared ANN is used to decide if an application is malware or friendly. The prepared ANN and test information are used as benchmarks for fresh applications. The preparation document has a vaguely different composition from the test file, which houses the component vector related with all applications. The recognition process yields an outcome file which has the order result. The outcome is +1 or −1 in this framework. When there is a positive outcome the application is characterized as friendly, whereas in a negative outcome the application is characterized as malicious. Pseudocode for permission based Detection: Step 1: Begin Step 2: Gather the information from executing application Step 3: Dump the permission asked for by every application Step 4: Utilize Android Asset Packaging (aapt) to recreate the source code and acquire the AndroidMainFest.xml request for every application Step 5: Feature Extraction Collect set of files with permission Step 6: Processing and mapping the data to the prerequisite format of the ANN. Step 7: Assigning an integer to every feature Step 8: If permission is requested then feature value=1 else 0 Step 9: Input vector send to Matlab NN Toolbox to execute the permission based recognition Step 10: If output is +1 then no malware Else if output is -1 then malware The detection system based on system calls has a comparable procedure to the detection system based on permissions. The biggest difference is that it uses a varied data source. In these next few steps, there is a brief introduction to the working of system call-based detection. Step 1: Data set collection and classification. The primary step is to gather the data set. The data set has real time friendly applications and malware samples and we segregate them into different groups. Step 2: System calls recording. A recognized tool trace is employed to record the system calls requested by the friendly applications as well as the malware samples. Nexus Root Toolkit v1.6.2 is used to avail root permission on Android devices so as to install the trace. Then the trace is run and the system calls made by both the friendly and malware applications are recorded. An Android Debug Bridge (ADB) is used to install the malware on an Android device from a remote computer. Step 3: Feature extraction. Each executed application generates a file which contains the system calls and all such files are recorded in a set. The data has to be mapped and processed to prerequisite format provided by the ANN. Each system call is mapped to an integer. As an example, get current process ID -getpid is mapped to 1, readfile is mapped to 3, and readconsole is mapped to 6. We can use 2-gram protocol by using two consecutive system calls as a detection feature instead if one. In order to map the 2-gram, all pairs of continuous integers are combined and generate an outcome similar to "0101 0103 0306 0601 0116 1616 1616 1616 1608" where "0103" denotes system calls getpid and readfile being executed in order. The proportion of density of the system calls is calculated by finding the ratio of the number of each system calls to the total number of system calls generated. Hence we can denote a feature and its value. Step 4: Classifier learning. The step 4 of system call based detection is similar to the step 4 of permission-based detection. After this step, the training process of the ANN is complete and is ready to be used for online malware detection. The procedure of the online detection phase is same as the offline training phase. Now, we execute the application, discard the system calls and map the sequence of system calls to the prerequisite format of the ANN to classify the application. Using the ANN trained by the offline training phase, we can identify if a new application is malicious or friendly. Step 1: Begin Step 2: collect the dataset Step 3: record the system calls and obtain root permission Step 4: Running trace and capturing the system calls used by the friendly as well as malicious applications. Step 5: to install malware utilize Android Debug Bridge (ADB) Step 6: Feature Extraction record a set of files Step 7: Processing and mapping the data to the prerequisite format of the ANN. Step 8: Assigning an integer to every feature Step 9: If permission is requested then feature value=1 else 0 Step 9: Input vector send to Matlab NN Toolbox to execute the permission based recognition Step 10: if output is +1 then no malware Else if output is -1 then malware Simulated results area is included in the experimental results and performance evaluation of malware detection. The assessment requires two experiments in which one utilizes the public dataset MalGenome and other one is based on a private dataset. MalGenome experiment used k-fold cross validation, otherwise known as the tenfold method. Kfold cross validation utilizes the holdout scheme and runs in a loop, K-fold times. Two segments of dataset are, as the testing set is K subsets and as the training set is K-1 subsets. In the end, the median of all K trials are calculated for getting the result of evaluation. During the newest malware experiment used for both test set. To discover the potential for a forecast relationship, the training set is used as a set of data. Whereas the other test set contributes a principal role in examining how efficient the classifier is. It is worthless by not using the test dataset in the training dataset. Finally, the ideal classifier was determined as the experimental results of both situations. Our data set consists of 1449 apps in total. We collected 1008 top free apps across different category from Google Play to create a benign set. Our malware set consists of 441 apps taken from Android malware Genome Project. We used ApkTotal to make sure that our benign set is free from any malware. In this research work ANN is used to detect the presence of malware in CSDMC2019 API dataset. This dataset is composed of a selection of windows API/System call trace files, intended for testing on classifiers treating with sequences. We use MALTAB GUI to design the proposed malware detection system which is shown in Fig. 2 . The GUI design is flexible and can be comfortable for all the three stages of the proposed system such as malware creation, detection and prevention. Also, we have tested the proposed malware detection system while connecting it to a mobile OS and transferring a file from mobile OS to desktop OS. Once the mobile device is connected it request for the access and the access will be granted if the verification process is successfully completed. The features from the OS will be extracted and using PCA the dimensionality of the extracted features is reduced. The reduced features will be compared with the malware features based upon the training data stored in ANN. If any similarities found between the features the system detects the presence of malware which is created while transferring the data. Figure 3 shows the structure ANN while training for malware detection after loading the downloaded database and after the completion of feature extraction process and feature reduction by PCA. The OS customization initially scans the data in the desktop for malwares. Once the process of desktop OS customization completes the process of file transfer from mobile device will be initiated. As of an initial step a request window will appear to connect the mobile device. Then, the mobile device gets access to the data from the mobile device, it will be correlated and the proposed system loads and scans the data for further processing (Figs. 4 and 5) . The mobile OS get customized the user id will get verified to initiate the process of file transfer from the mobile device to the desktop system. Then, the verification process completed successfully then the process starts with accessing the files from the mobile device. Once the files are accessed the files from the appropriate device get tracked and transferred via API. Figure 6 shows the screenshot captured while selecting the files which is to be transferred and processed. Again the verification process takes place to improve the system security. Figure 7 shows the screenshot captured after when the file transfer process get completed and accessed completely. After the completion of file transfer the process of malware detection get initiated again. It checks the presence of malwares occurred during the process of file transfer (Fig. 8) . With the presence of malware detected, mail intimation will be directed to the authenticated mail id using MATLAB. The mail also contains the label of the detected malware with the intimation of corrupted file (Figs. 9 and 10) . Trivial hash based deadlock algorithm prevents the overall system and files from the harmful malwares which may create during file transfer (Fig. 11 ). The process from user verification to OS customization will repeat to prevent the entry of malwares in any system. To prevent Malwares using trivial hash deadlock the malware detection process will be initiated and once if detected any suspicious activity the internal and external operations of the system get blocked and no one can access any files from the system. Now an intimation to reboot the system along with malware intimation will be sent to the users' mail and can operate the system only after rebooting the system. Figure 12 shows the intimation of operations blocked when a malware enters into the system. Then the system is intimated to reboot after detection of malware to prevent further harm to the system. Once the above process gets succeeded the system OS get customized along with the mobile OS and also a successful file transfer was performed and the rebooted system will completely free from malwares and also protected from future malwares. The principal aim of this presentation was to show the efficiency of updating antivirus tools with new unfamiliar malwares. We can detect new malware by using an updated classifier which can be used for sustaining an anti-virus tool. Labeled files must be modernized for both the anti-virus and its detection model called as classifier. The labeling can be done physically by experts, consequently the aim of the classification is focused effort on labeling files which are likely to be malware or new information added files about benign files. In this research, evaluation of various machine learning classifiers are to increase the detection of malware outcome for a strong and large collection of file samples and acquire the optimum classifier which can detect mobile malware. The classifiers were Artificial Neural Network (ANN), Bayes network, decision tree (DT) (J48), K-nearest neighbor (KNN) and support vector machine (SVM). Our experiment comprised 49 separate families containing 1,260 Android malware samples included by the MalGenome project samples whereas only 1000 were utilized. There are three phases in the machine learning process: (1) data collection, which captures network traffic; (2) feature selection and extraction; and (3) the machine learning classifier. Malicious code detection for open firmware Static detection of malicious code in executable programs Computer Security: Principles And Practice Machine Learning Methods for Malware Detection and Classification A survey on security for mobile devices Mobile device security: a survey on mobile device threats, vulnerabilities and their defensive mechanism Detecting unknown malicious code by applying classification techniques on OpCode patterns Detecting scareware by mining variable length instruction sequences Accurate adware detection using opcode sequence extraction Detection of spyware by mining executable files Survey on security for mobile device: threats and vulnerability Identification of common molecular subsequences Malware and malware detection techniques: a survey Droidchameleon: evaluating android anti-malware against transformation attacks Detecting mobile malware threats to homeland security through static analysis Rage against the virtual machine: hindering dynamic analysis of android malware Crowdroid: behavior-based malware detection system for android Andromaly": a behavioral malware detection system for android devices Static analysis of executables for collaborative malware detection on android