key: cord-0059405-mtul0xqw authors: Kurniawan, Kabul; Ekelhart, Andreas; Ekaputra, Fajar; Kiesling, Elmar title: Cross-Platform File System Activity Monitoring and Forensics – A Semantic Approach date: 2020-08-01 journal: ICT Systems Security and Privacy Protection DOI: 10.1007/978-3-030-58201-2_26 sha: f774f05aac729cb62e60f12da5fe6e3bff392746 doc_id: 59405 cord_uid: mtul0xqw Ensuring data confidentiality and integrity are key concerns for information security professionals, who typically have to obtain and integrate information from multiple sources to detect unauthorized data modifications and transmissions. The instrumentation that operating systems provide for the monitoring of file system level activity can yield important clues on possible data tampering and exfiltration activity but the raw data that these tools provide is difficult to interpret, contextualize and query. In this paper, we propose and implement an architecture for file system activity log acquisition, extraction, linking and storage that leverages semantic techniques to tackle limitations of existing monitoring approaches in terms of integration, contextualization, and cross-platform interoperability. We illustrate the applicability of the proposed approach in both forensic and monitoring scenarios and conduct a performance evaluation in a virtual setting. In our increasingly digitized world, Information and Communication Technologies pervade all areas of modern life. Consequently, organizations face difficult challenges in protecting the confidentiality and integrity of the data they control, and theft of corporate information -i.e., data breaches or data leakage -have become a critical concern [7] . In the face of increasingly comprehensive collection of sensitive data, such incidents can become an existential threat that severely impacts the affected organization, e.g., in terms of reputation loss, decreased trustworthiness, and direct consequence that affect their bottom line. Fines and legal fees, either due to contractual obligations or laws and regulations (e.g., the General Data Protection Regulation in the EU), have become another critical risk. Overall, the number and size of data breaches have been on the rise in recent years 1 . On a technical level, exfiltration of sensitive data is often difficult to detect. In this context, we distinguish two main types of adversaries and associated threat models: (i) an insider with legitimate access to data, who either purposely or accidentally exfiltrates data, and (ii) an external attacker who obtains access illegitimately. Insiders typically have multiple channels for exfiltration at their disposal, including conventional protocols (e.g., ftp, sftp, ssh, scp), cloud storage services (e.g., dropbox, onedrive, google drive, WeTransfer), physical media (e.g., USB, laptop, mobile phone), messaging and email applications, and dns tunneling [11] . Whereas an insider may leverage legitimate access permissions directly or at least internal resources as a starting point, an external attacker must first infiltrate the organization network and obtain access to the data (e.g., by spreading malware or spyware, stealing credentials, eavesdropping, brute forcing employee passwords, etc.). State-of-the-art perimeter security solutions such as intrusion detection and prevention systems (IDS/IPS), firewalls, and network traffic anomaly detection are per se generally not capable of detecting insider attacks [20] . However, such activities typically leave traces in the network and on the involved systems, which can be used to spot potential misuse in real time or to reconstruct and document the sequence of events associated with an exfiltration and its scope ex-post. This examination, interpretation, and reconstruction of trace evidence in the computing environment is part of digital forensics. Upon detection of security violations, forensic analysts attempt to investigate the relevant causes and effects, frequently following the hypothesis-based approach to digital forensics [6] . Although there are a variety of tools and techniques available that are employed during a digital investigation, the lack of integration and interoperability between them, as well as the formats of their sources and resulting data hinder the analysis process [8] . In this paper, we introduce a novel approach that leverages semantic web technologies to address these challenges in the context of file system activity analysis. This approach can harmonize heterogeneous file and process information across operating systems and log sources. Furthermore, it provides contextualization through interlinking with relevant information and background knowledge. The research question we address in this article is: How can semantic technologies support digital file activity investigations? Addressing this question resulted in the following main contributions: (i) a set of log and file event vocabularies (Sect. 3); (ii) an architecture and prototypical implementation for file system log acquisition, event extraction, and interlinking across heterogeneous systems and with background knowledge (Sect. 4); (iii) a set of demonstration scenarios for continuous monitoring and forensic investigations (Sect. 5); and (iv) a performance evaluation in a virtual setting (Sect. 6). Our approach builds upon and integrates multiple strands of work, which we will review in the following: (i) approaches for file activity monitoring, both in the academic literature and commercial tools; (ii) file system ontologies; and (iii) semantic file monitoring & forensics. File Activity Monitoring. In contrast to the approach presented in this paper, prior work in this category does not involve semantic or graph-based modeling, which facilitates interoperability and integration, contextualization through interlinking with background knowledge, and reasoning. The authors in [12] focus on data exfiltration by insiders. They first apply statistical analyses to characterize legitimate file access patterns and compare those to file access patterns of recent activities to identify anomalies. The authors mention that the approach can result in a high number of suspicious activities, which can be impractical for individual investigation. [4] aims to predict insider threats by monitoring various parameters such as file access activity, USB storage activity, application usage, and sessions. In their evaluation, they train a deep learning model on legitimate user activity and then use the model to assign threat scores to unseen activities. In [3] , the authors introduce a policy-based system for data leakage detection that utilizes operating system call provenance. They facilitate real-time detection of data leakage by tracking operations performed on sensitive files. This approach is similar to the one presented in this paper in its objectives, i.e., it also aims to monitor file activities (copy, rename, move), but it does not cover contextualization and linking to background knowledge. [9] proposes an approach that leverages data provenance information from OS kernel messages to detect exfiltration of data returned to users from a database. The proposed system builds profiles of users' actions to determine whether actions are consistent with the tasks of the users. While it has similar goals, the focus is limited on data exfiltration from databases via files. Apart from the academic research on various techniques for file activity monitoring, a wide range of tools is available commercially, such as Solarwind Server and Application Monitor, ManageEngine DataSecurity Plus, PA File Insight, STEALTHbits File Activity Monitor, and Decision File Audit. These tools cover varying scopes of leakage detection and typically provide a simple alerting mechanism upon suspicious activity. Another category of existing tools are Security Information and Event Management systems (e.g., LogDNA, Splunk, Elastic-Search). Their purpose is to manage and analyze logs and they do not specifically tackle the problem of tracking file activity life-cycles. File System Ontologies. Ontological representation of file system information has been explored, e.g., in [18] , in which the authors propose TripFS, a lightweight framework that applies Linked Data principles for file systems in order to expose their content via dereferenceable HTTP URIs. The authors model file systems with their published vocabulary that is aligned with the NEPOMUK File Ontology (NFO) 2 . Similar to TripFS, [19] proposes VDB-FilePub to expose file systems as Linked Data and to publish user-defined content metadata. With focus on end-user access, [17] provide an extension to TripFS which enables users to navigate the published files, and to annotate and download them via common web browsers without the need to install special software packages. In recent work, the authors of [16] proposed a Semantic File System (SFS) Ontology 3 which extends terms from the NEPOMUK ontology. They further provide technical definitions of terms and a class hierarchy with persistent URIs and content negotiation capabilities. In our approach, we use the basic concepts for files, such as file names and file properties as proposed in the related work, but our approach integrates additional concepts, such as, e.g., file activities, source and target locations, and file classification. The application of semantics for digital forensics has been the topic of multiple research publications. While they are motivated by similar challenges, such as heterogeneity, variety and volume of data, they do not focus on file activity monitoring and life-cycle construction in particular, but on the digital evidence process in general. Early work on using semantic web technology in the context of forensics includes [13] , which introduces an evidence management methodology to semantically encode why evidence is considered important. An ontology is used to describe the metadata file contents and events in a uniform and applicationindependent manner. In [1] , the authors propose a similar ontology-based framework to assist investigators in analyzing digital evidence. They motivate the use of semantic technologies in general and discuss the advantage of ontological linking, annotations, and entity extraction. A broader architecture to lift the phases of a digital forensic investigations to a knowledge-driven setting is proposed in [8] . This results in an integrated platform for forensic investigation that deals with a variety of unstructured information (e.g., network traffic, firewall logs, and files) and builds a knowledge base that can be consulted to gain insights from previous cases via SPARQL queries. Finally, in a recent contribution [2] , the authors propose a framework that supports forensic investigators during the analysis process. This framework extracts and models individual pieces of evidence, integrates and correlates them using a SWRL rule engine, and persists them in a triplestore. Compared to our approach, their focus is on text processing while file activity analysis is not considered. The approach presented in this paper extends preliminary work published in [15] by introducing cross-platform interoperability, scenarios that demonstrate the approach, linking to background knowledge and a performance evaluation. Operating systems typically provide mechanisms and instrumentation to obtain information on system-level file system operations, typically on the level of kernel calls. Reconstructing the corresponding user activities, such as editing, moving, copying or deleting a file from these low-level signals can be challenging. In particular, the sequence of micro-operations triggered by a file system operation varies across operating systems and applications, which complicates the analysis. On Windows systems, for instance, file operations such as Create generate a number of access operations including ReadAttributes, WriteData, ObjectClosed, etc. To construct our vocabularies, we analyzed the structure, format, and access patterns of the different file activity log sources on both Windows and Linux. Furthermore, as contextualization is a key requirement for the interpretation of file activity in forensic analyses, we also include sources of (i) process activity information, and (ii) authentication events (login, logout, etc.). The scenarios in Sect. 5 illustrate how we make use of process information and authentication information. Due to space restrictions, we will not cover the process and authentication vocabulary in full detail and refer the interested reader to the source 4 . As existing ontologies (reviewed in Sect. 2) do not fully cover the requirements of our approach, we developed a custom ontology. We followed a bottom-up approach starting from low-level information from log sources with the goal to choose and collect appropriate terms directly from the sources of evidence (e.g. users, hosts, files). We organize our semantic model into two levels, i.e., log entry level and file operation level. On the log entry level, we define a vocabulary to represent information on micro-level operations for both Windows and Linux OS log sources which is based on a previously developed vocabulary [10] for generic log data. On the file operation level, we model a generic vocabulary to express higher-level events such as actual file event activity (e.g., created, modified, copied, rename, delete) derived from micro-level operations (Fig. 1) . Log Entry Vocabularies. The Windows Log Event (wle) vocabulary 5 represents Windows file access events using wle:WindowsEventLogEntry, a subclass of cl:LogEntry from the SEPSES core log 6 . The wle:Subject class represents account information such as wle:accountName and wle:logonID; the wle:AccessRequest class represents file access information such as wle:access-Mask and wle:accesses; the wle:Process class represents running processes and the wle:Object class represents object file information such as wle:objectName, wle:objectType, and wle:handleID. To cover Linux file access events, we developed the Linux Log Event (lle) 7 vocabulary that comprises five main classes: lle:LinuxEventLogEntry, a subclass of cl:LogEntry from the SEPSES core vocabulary, lle:Event class, which covers information on file access events such as lle:eventType, lle:eventId, lle:eventCategory, and lle:eventAction; the lle:File class represents information about file objects such as lle:fileName and lle:filePath; the lle:User class covers information on users who perform the file event activities such as lle:userName and lle:userGroup; the lle:Host class represents lle:hostArchitecture, lle:hostOS, lle:hostName, lle:hostId, etc. The File Operation vocabulary 8 describes fae:FileAccessEvents by means of the following properties: fae:hasAction reflects the type of access (e.g., created, modified, copied, renamed, deleted); fae:hasUser links the file event to the user accessing the file; fae:hasProgram represents the executable used to access the file, and fae:timestamp captures the time of access. The properties fae:hasSourceFile and fae:hasTargetFile model the relation between an original and copied instance of a file. Finally, property fae:hasSourceHost and fae:hasTargetHost represent the hosts where the source and target files are located. To support contextualization and enrichment, we leverage several existing sources of internal and external background knowledge. Internal background knowledge can be developed by manually or automatically collecting an organization's persistent information (e.g. IT Assets, Network Infrastructure, Users). In our scenarios, we use predefined internal background knowledge to contextualize and create linking with file access events during event extraction. Furthermore, it is possible to leverage existing external knowledge, such as the SEPSES cybersecurity knowledge graph (CSKG) 9 , to link external information with system events. In this section, we describe our architecture and prototypical implementation for semantic integration, monitoring, and analysis of file system activity as depicted in Fig. 2 . The Log Acquisition component deals with the acquisition of log information and is installed as an agent on clients or servers. We implement our Log Acquisition component on Filebeat 10 , an open-source log data acquisition tool that ships log data from a host for further processing. Using Filebeat, we can easily select and configure and add log sources from both Windows and Linux machines. Furthermore, we use the Filebeat Audit module to ship process and authentication information from the log sources. The Log Extraction component handles the parsing of various log data provided by the Log Acquisition component and can act as a filter that keeps only relevant parts. We use Logstash 11 , an open source log processing tool that provides options for developing processing pipelines to distinguish and handle different types of log sources. Furthermore, it provides different output options such as a web socket protocol that supports data streaming. The RDF-ization component transforms data into RDF by mapping structured log data produced by the Log Extraction component to a set of predefined ontologies (cf. Sect. 3). This produces an RDF graph as the basis of file operation events extraction. We use TripleWave 12 to publish RDF streaming data through specified mappings (e.g. RML 13 ). Furthermore, TripleWave supports the web socket protocol to publish the output. The Event Extraction component generates file operation events by identifying a sequence of low level (e.g., kernel-level) file system events. Furthermore, it enriches the events by creating links between file operation events and existing internal (hosts, users, etc.) and external (e.g., the SEPSES cybersecurity knowledge graph [14] ) background knowledge. We developed a Java-based event extractor 14 and use the C-Sprite [5] engine to implement the event extraction process. C-Sprite is an RDF stream processing engine that allows us to register a set of continuous SPARQL-Construct queries against the low level RDF graph of file system events to generate a graph of file operation events. Finally, the Data Storage, Querying, and Visualization component stores the extracted RDF graph of file operation events in a persistent storage (e.g., a triplestore) and facilitates querying and further analysis. We choose the widely-used Virtuoso 15 triple store, which provides a SPARQL endpoint, for our prototypical implementation. Furthermore, we developed a simple web-based graph visualization interface 16 that helps analysts to interpret file access lifecycles (cf. Sect. 5 for an example). In this section, we demonstrate the feasibility of our approach by means of two application scenarios. For both scenarios, we set up a virtual lab with several Windows and Linux machines, users, groups, and shared folders. In the first scenario, we assume that an organization has learned that confidential information was leaked. The task in this scenario is to investigate how and by whom this information has been transferred out of the organizational network. Figure 3 depicts an excerpt of the company network, including Linux and Windows workstations and a Linux file server that stores company-wide shared data as well as confidential data with restricted access permissions (e.g., customer and financial data). The organization's access model distinguishes two groups: manager and office users. Both groups are authorized to log in to the company workstations and access the internal file shares. Access to the confidential data is restricted to the manager group. As a starting point, the analyst has the name of a file that contains the leaked sensitive information and starts to investigate its history. Listing 1.2 depicts the SPARQL query to obtain lifecycle information for this file. The result is given in Table 1 and shows that the file cstcp001.xls was accessed and modified multiple times. Inspecting the timeline, we can see that a file customer.xls was modified on FileServer1 with the IP 193.168.1.2. It thereafter was copied, renamed and modified on the file server. Then, the file appeared on Workstation2 and got deleted from the file server. Finally, the file was renamed to cstcp001.xls and copied to another folder on Workstation2 with the name Dropbox in its file path. Figure 4 visualizes the file history. Next the analyst wants to know how the file was transferred from FileServer1 to Workstation2. A SPARQL query 17 lists the running processes and user names in the time period of the suspicious activities. Potential exfiltration processes are modeled in the background knowledge with the concept sys:potentialExfitration-Processes, which includes channels such as FTP, SCP, SSH, etc. This illustrates how queries can automatically make use of modeled background knowledge. Table 2 shows the results of the query. From this, the analyst learns that a secure copy event /usr/bin/scp was started on FileServer1 prior to the file copy and also on the Windows host Workstation2. The processes on the file server were performed by user Alice from the manager group. The analyst concludes that the customer-cp.xls file was successfully transferred via SCP (SSH service) by the user Alice. Next, the analyst wants to collect more information about this file transfer and the users involved in those steps. Therefore, a LoginProcess 18 query is executed to retrieve a list of users logged in to these hosts in the time period of interest, including userName, sourceIp, targetIp, hostName, and the timestamp. The query result depicted in Table 3 shows that Alice was not logged in to Work-station1 during this time. Instead, Bob shows up several times in the login list of Workstation1. From Workstation1, a login event was performed on FileServer1 with Alice's credentials. At the time the file copy to the Dropbox folder happened on Workstation2, only Bob was logged in on this computer. Concluding from this evidence, the analysts suspects that Bob logged in to Workstation1, then accessed the confidential file on FileServer1 with the credentials of Alice. Finally, he copied the file to Workstation2 and exfiltrated the data via Dropbox. In the second scenario, we illustrate how the semantic monitoring approach can be used to protect confidential information by combining public vulnerability information with file activity information from inside the company network. We assume a policy that restricts handling of confidential files on hosts with known vulnerabilities. The objective in this scenario is to automatically detect violations of this policy. More precisely, the goal is to spot whenever files flagged as confidential 19 are copied or created on an internal host with a known vulnerability. As background knowledge, we import information on installed software on each host. This information is represented in the Common Platform Enumeration (CPE) format and can be collected automatically by means of software inventory tools. To link this information to known vulnerabilities, we rely on Common Vulnerabilities and Exposures (CVE), a well-established enumeration of publicly known cybersecurity vulnerabilities. We take advantage of our recent work on transforming this structured knowledge into a knowledge graph [14] available via various semantic endpoints. This allows us to directly integrate this information and use it in our scenario. To implement the monitoring in this scenario, we set up a federated continuous SPARQL query at Listing 1.2 to identify whether a sensitive file shows up on a vulnerable workstation. To restrict the query to confidential files, we use the property asset:hasDataClassification and restrict our query to sys:Private files. Table 4 shows the query results and reveals that Worksta-tion2 and Workstation3 have critical vulnerabilities, but store confidential files. The results include the fileName, hostName, hostIP, cveId, etc. As a next step, an analyst can inspect the life-cycle of the files to understand where they came from, who accessed them and explore information on the vulnerabilities and potential mitigations. Taking automated actions based on the results, such as blocking the access or alerting the user, is a further option. In this section, we present our empirical evaluation setup and discuss the results. We ran the experiments on an Intel Core i7 processor with 2,70 GHz, 16 GB RAM, and 64-bit Microsoft Windows 10 Professional and emulate hosts as docker containers. We used C-Sprite as event extraction engine with a 3 seconds time window that slides every second. In order to simulate user activity, we developed a java-based event generator 20 to generate scripts for random file activities and use weighted random choices to select activities. To measure the correctness and the completeness of the event extraction and detection using RDF stream processing with C-Sprite, we define a set of metrics, including (i) Actual Events (AE) -number of the events executed in the simulation (ground truth), and (ii) Returned Events (RE) -number of events correctly detected by the RDF-Stream processing (C-Sprite). We get detection (%D) by dividing RE by AE. On each target OS (Linux and Windows), we test a varying number of events per second, i.e. 1, 10, 20, 50, 80, 100, 125 and 200 events/sec. In the results, we report the mean of detected events over 5 runs with 480 simulated events each. As shown for Linux in Fig. 5 , all events can be detected close to 100% for all frequencies (1 event/sec up to 200 events/sec) except the copy event, which reached a maximum of 91,89%. At 200 events/sec, we observe that the detection of copy events decreases to approx. 70%, which is mainly caused by incorrect pairings of readAttribute and create events when these micro operations generated by two or more sequential copy events appear together in the same window. Furthermore, we noticed that low-level events sometimes do not arrive in sequence and hence, are not detected by our queries. For Windows, the event detection performance for created, modified, renamed and deleted events is higher with almost 100% of detected events for all frequencies. However, the copy event detection in Windows achieves a lower detection with a maximum of 75,46%. Finally, considering scalability we can make an estimation based on [5] , which shows that C-Sprite achieves a throughput of more than 300000 triples/s. Consequently, it should be able to handle up to 23000 events/s (an individual event consists of at least 13 triples). For forensic scenarios, the Virtuoso triple store can load more than 500 million triples per 16 GB RAM 21 , which means that it should be possible to handle more than 38 million events per 16 GB RAM. In this paper, we tackled current challenges in file activity monitoring and analysis, such as the lack of interoperability, contextualization and uniform querying capability, by means of an architecture based on Semantic Web technologies. We introduced a set of vocabularies to model and harmonize heterogeneous file activity log sources and implemented a prototype. We illustrate how this prototype can monitor file system activities, trace file life cycles, and enrich them with information to understand their context (e.g., internal and external background knowledge). The integrated data can then be queried, visualized, and dynamically explored by security analysts, as well as be used to facilitate detection and alerting by utilizing stream processing engines. Finally, we demonstrate the applicability of the approach in two scenarios in virtual environments -one focused on data exfiltration forensics, and another on monitoring policy violations integrating public vulnerability information. The results of our evaluation indicate that the approach can effectively extract and link micro-level operations of multiple operating systems and consolidate them in an integrated stream of semantically explicit file activities. Overall, the results are promising and demonstrate how semantic technologies can enrich digital investigations and security monitoring processes. In future work, we aim to address the accuracy and scalability limitations of the current approach identified in the streaming evaluation, e.g., by evaluating alternative streaming engines and alternative approaches (e.g. complex event processing) based on big data technologies. Furthermore, we will investigate the integration of our approach into existing standards (e.g., STIX and CASE) to increase interoperability for forensic investigation. An ontology-based forensic analysis tool An application of semantic techniques for forensic analysis Data leakage detection using system call provenance Predicting insider threats by behavioural analysis using deep learning C-sprite: efficient hierarchical reasoning for rapid RDF stream processing A hypothesis-based approach to digital forensic investigations Enterprise data breach: causes, challenges, prevention, and future directions A semantic-web-technology-based framework for supporting knowledge-driven digital forensics PANDDE: provenance-based anomaly detection of data exfiltration Taming the logs -vocabularies for semantic security analysis Data leakage -threats and mitigation Profiling file repository access patterns for identifying data exfiltration activities Semantic modelling of digital forensic evidence The SEPSES knowledge graph: an integrated resource for cybersecurity Semantic integration and monitoring of file system activity The design and development of a semantic file system ontology Ad-hoc file sharing using linked data technologies Lifting file systems into the linked data cloud with TripFs Publishing distributed files as linked data An integrated data exfiltration monitoring tool for a large organization with highly confidential data source Acknowledgments. This work was sponsored by the Austrian Science Fund (FWF) and netidee SCIENCE under grant P30437-N31, and the COMET K1 program by the Austrian Research Promotion Agency. The authors thank the funders for their generous support.