key: cord-0957786-jie2ztak authors: Švábenský, Valdemar; Vykopal, Jan; Čeleda, Pavel; Tkáčik, Kristián; Popovič, Daniel title: Student assessment in cybersecurity training automated by pattern mining and clustering date: 2022-03-30 journal: Educ Inf Technol (Dordr) DOI: 10.1007/s10639-022-10954-4 sha: f89ce1a60530924a558d3053f679405aae88551d doc_id: 957786 cord_uid: jie2ztak Hands-on cybersecurity training allows students and professionals to practice various tools and improve their technical skills. The training occurs in an interactive learning environment that enables completing sophisticated tasks in full-fledged operating systems, networks, and applications. During the training, the learning environment allows collecting data about trainees’ interactions with the environment, such as their usage of command-line tools. These data contain patterns indicative of trainees’ learning processes, and revealing them allows to assess the trainees and provide feedback to help them learn. However, automated analysis of these data is challenging. The training tasks feature complex problem-solving, and many different solution approaches are possible. Moreover, the trainees generate vast amounts of interaction data. This paper explores a dataset from 18 cybersecurity training sessions using data mining and machine learning techniques. We employed pattern mining and clustering to analyze 8834 commands collected from 113 trainees, revealing their typical behavior, mistakes, solution strategies, and difficult training stages. Pattern mining proved suitable in capturing timing information and tool usage frequency. Clustering underlined that many trainees often face the same issues, which can be addressed by targeted scaffolding. Our results show that data mining methods are suitable for analyzing cybersecurity training data. Educational researchers and practitioners can apply these methods in their contexts to assess trainees, support them, and improve the training design. Artifacts associated with this research are publicly available. data, the instructor can see that each student needs help with a specific and different aspect of the training. Why is assessment difficult? In-depth assessment of cybersecurity training is difficult for four main reasons. 1. The training is complex. The tasks require high-order problem solving and may have many different correct solutions. Therefore, the assessment is much more complex than assessing simple tasks such as memorizing facts. 2. Each student is unique. Every student has different previous knowledge, experience, motivation, and approach to learning. As a result, students adopt different strategies to solve the tasks. This is natural, but it further complicates the conditions for automatically assessing hands-on tasks. 3. Students generate a lot of data. During the training, even a class that is relatively small (10-20 students) and time-constrained (1-2 hours) can generate hundreds of data records. As a result, manually processing these data becomes quickly infeasible. 4. The assessment process is not straightforward. It is unclear how to transform the raw data from training into educational insights useful for assessment. As the Fig. 1 The first student ran the cracking tool 24 times within an approximately 5-minute time frame, with various combinations of arguments, often repeating the previous (incorrect) combinations. After that, the student stopped for 4 minutes, probably to find help, and executed a correct command Fig. 2 The second student assumed that the tool had the .exe suffix of Windows OS executables, which does not apply to Linux OS. The student was apparently unfamiliar with Linux or the cracking tool but then instantly executed a correct command without any previous incorrect tries. We can assume that they received outside help examples in Figs. 1 and 2 demonstrated, even a relatively constrained assignment can generate various data for assessment. The need for research Traditionally, educational researchers and practitioners assessed student data manually. However, due to the difficulties described above, a manual transformation of hands-on training data into educational insights is not viable (Fournier-Viger, 2017; Romero & Ventura, 2020) . It is highly time-consuming, ineffective, and error-prone. Automated assessment is more scalable and accurate. Therefore, it can be fruitful to leverage automated techniques, such as machine learning and data mining, for analyzing data from hands-on training (Palmer, 2019) . These techniques should transform the data from their raw form to an understandable representation, such as an overview of highlights or a visualization. However, the review of current literature (see Section 3 for details) identified several gaps in state of the art in this area: • As Weiss et al. (2016) argued, current automated assessment is often superficial, judging only the (in)correctness of the solution. Only a few papers, such as by Mirkovic et al. (2020) , have explored an in-depth assessment of student learning. • To the best of our knowledge, no published research attempted to compare and evaluate the applicability of two different data mining methods on cybersecurity training data. Student assessment in cybersecurity has been explored from other perspectives, such as using numerical scoring metrics (see Maennel et al. (2017) for an example). • Data mining algorithms have been used for assessment in other domains, such as programming (Gao et al., 2021) , but it is unclear how to generalize these previous results to the cybersecurity context. We seek to support automated assessment of students in hands-on training. In order to address the gaps in the literature, the assessment must satisfy the following criteria: • enable an in-depth understanding of students' actions, • use methods that have not been researched in this context previously, and • be evaluated on an authentic dataset from realistic training sessions. The domain of data mining offers many methods for the automated extraction of insights from raw data (Fournier-Viger, 2017) . Two methods that satisfy the criteria above and will be explored in this paper are pattern mining and clustering. Pattern mining techniques, such as association rule mining and sequential pattern mining, can reveal interesting relationships in datasets (Fournier-Viger, 2013b) . Clustering, on the other hand, forms groups of data based on their similar characteristics . Evaluating these two techniques represents an original contribution to cybersecurity education and beyond. Research questions Our research is framed by two research questions related to student assessment in cybersecurity: What insights can we gather from command histories using pattern mining (RQ1) and clustering (RQ2)? By insights, we mean the following educational findings to support assessment: • trainees' approaches and strategies to solving the training tasks, • common mistakes, misconceptions, and tools problematic for trainees, • distinct types of trainees based on their actions and behavior, and • issues in the training design and execution. Expected contributions of this research Answering the research questions will be valuable for various stakeholders. • Cybersecurity instructors can use the researched methods in their classes to gain new insights for assessing their students. Specific assessment use cases are detailed in Sections 5.3 and 5.5. • Researchers can build upon this work by evaluating other data mining methods on similar datasets. This will contribute to the body of knowledge on assessment in cybersecurity training. • Developers of cybersecurity training platforms can integrate the researched methods of data collection and analysis into the interactive learning environments. This will support the goals of instructors and researchers. Educational stakeholders from outside the cybersecurity domain can benefit from this research as well. Students of related computing disciplines, such as networking and operating systems administration, can generate similar data for assessment in hands-on classes. For students of other disciplines, the researched methods can be extended to process different data, such as clickstreams. Above, we defined three target groups who may be interested in this paper. Although we aim to address readers from a broad audience, we acknowledge that some sections of the paper are not relevant for everyone. Section 2 provides a brief background and therefore aims at researchers who seek to understand the theory of the used methods. Other readers who are satisfied with a more high-level understanding may skip it. Section 3 reviews related studies, which is relevant for researchers and instructors interested in how the previous research results were applied to support teaching practice. Section 4 details the used methods for the data collection and analysis. It is aimed mainly at researchers and developers, since it also includes technical details about the training platforms and data collection. Section 5 presents the findings and answers the research questions. Finally, Section 6 concludes, summarizes our contributions, and proposes future work. These two sections are suitable for all readers. This section defines the key terms to familiarize the readers with basic data mining concepts. Data mining is a field of computing that deals with extracting knowledge from data. Its purpose is to enable understanding of the data, gather new insights from them, and support decision-making based on this understanding Han et al., 2011) . Out of the many data mining methods, we will focus on two of them: pattern mining (Section 2.2) and clustering (Section 2.3). Educational data mining (EDM) and Learning analytics (LA) (Lang et al., 2017) are two inter-related research areas that aim to understand and improve teaching and learning. The research in these areas focuses, for example, on student behavior, learning processes, assessment, and interactive learning environments. To achieve their aims, EDM/LA researchers collect and analyze data from educational settings. Pattern mining automatically extracts previously hidden patterns in data. Its objective is to discover patterns that are easily interpretable by humans. We concentrate on two well-established pattern mining techniques: association rule mining (ARM) and sequential pattern mining (SPM) (Fournier-Viger, 2013b; Fournier-Viger et al., 2017) . Association rule mining Association rules are patterns with the form of an if-then statement. A rule X → Y says that if an item X occurs in a transaction (a set of items), then so does Y Han et al., 2011; Romero et al., 2010) . In our case, an item may be a command submitted by a student, and a transaction may be a whole set of commands of that student. An association rule mined from a set of students' transactions may indicate that if a student used a command X, then they used a command Y. For each association rule X → Y , we are typically interested in two metrics: its support (relative occurrence among all the examined transactions) and confidence (relative occurrence among the transactions that contain X). Algorithms for mining association rules consider only rules that satisfy the userdefined thresholds for the minimal support and confidence, MinSup and MinConf. Since this process can extract a vast amount of rules, additional measures such as lift are applied to filter out irrelevant rules Han et al., 2011; Romero et al., 2010) . Sequential pattern mining Sequential pattern is a frequently occurring subsequence in a given set of sequences Romero et al., 2010) . For example, it can be a progression of certain commands that many students used. Contrary to ARM, SPM can analyze data in which the ordering of items is relevant. Again, sequential patterns are mined based on a MinSup threshold. To find a manageable amount of patterns, it is recommended to use algorithms that mine closed sequential patterns Fournier-Viger et al., 2014; Fumarola et al., 2016) . Clustering is the process of assigning data points into groups called clusters based on their similarity. Data in one group are similar to each other and dissimilar to data from other groups (Madhulatha, 2012) . For example, in our context, we can group students based on the similarities in their command-line usage. Clustering is an unsupervised machine learning technique, so it does not use previously labeled data to assess new data. Instead, it organizes unlabeled data into "bundles". We focus on density-based clustering, which defines a cluster as an area with a high density of data points; low-density areas separate individual clusters. Unlike partitional clustering methods, such as the popular k-means clustering (Lloyd, 1982) , density-based approaches are better at recognizing arbitrarily shaped clusters and filtering noise or outliers. However, not all data points may end up in a cluster (Beyer et al., 1999; Aggarwal et al., 2001) . This section reviews the publications related to the analysis of educational data. It also explains how our research differs from state of the art. Association rule mining (ARM) or sequential pattern mining (SPM) has been employed to investigate various aspects of education. These include learner difficulties, correlations between learning behaviors and performance, and teaching strategies that lead to better learning (Romero and Ventura, 2020; Bienkowski et al., 2012) . García et al. (2010) applied ARM on data capturing students' usage of a learning management system, discovering relationships between students' activities and final grades. Instructors can use this information to adjust the course or identify struggling students early. Kobayashi (2014) also used ARM to uncover the errors that frequently co-occurred at various proficiency levels when learning spoken English. The pattern mining revealed types of mistakes that distinguish lower-level and upper-level students. Malekian et al. (2020) applied SPM on data representing students' actions and task submissions in an online learning environment. The researchers wanted to discover the behavior patterns that lead to successful or unsuccessful assessment outcomes. Therefore, they split the sequences of actions into two categories depending on the outcome of the sequence's final submission. The failed sequences contained mainly repeated assessment submissions and discussion forum views. In contrast, the passed sequences included multiple reviews of lecture materials. This information can be used to modify the learning environment to discourage unproductive behavior. Gao et al. (2021) mined sequential patterns from programming logs to identify struggling students. Timely recognizing these students is essential for promoting their learning. To establish ground truth, the researchers again split the logs of highand low-performing students. Then, they mined patterns that either dominated in one group to discover its specifics, or occurred in both groups to reveal similarities. After that, they used the patterns as features in a classifier algorithm to predict student performance. Vellido et al. (2010) motivate the usage of clustering in educational contexts. In addition, they also provide a brief overview of literature where clustering was applied to solve educational problems. Next, Romero and Ventura (2010) and Dutt et al. (2017) performed literature reviews of EDM papers. Clustering has been used to provide feedback to instructors, detect undesirable or unusual student behavior, analyze and model student behavior, and group students by various characteristics, such as their learning approaches. Yin et al. (2015) used the OPTICS algorithm to cluster students' programming assignments, aiming to support autograding based on the type of solution. Student source code was represented as an abstract syntax tree, with the normalized tree edit distance as the similarity measure for clustering. The researchers discovered clusters corresponding to distinct types of solutions (canonical, correct but longer code, complex solution, and so on). McBroom et al. (2016) mined submission logs from an autograding system for program code. They clustered weekly submissions to find approaches to each assignment while also analyzing the long-term behavior to learn how students develop. The researchers detected common behavioral patterns as early as in week three of the semester, and students' behavior largely remained the same. Teachers can use the gained insight to intervene when a student belongs to the cluster with a higher risk of failure. The goal of Piech et al. (2012) was to study how students learn to program. To do so, the researchers captured and clustered temporal traces of student interactions with a compiler. They applied a hidden Markov model to the temporal traces and visualized it as a state machine for the cluster. The model then predicted student performance. Emerson et al. (2020) explored novices' misconceptions in block-based programming. The researchers used logs of unsuccessful student attempts at programming assignments. The students' programs were represented by three families of features: basic block features, counts of specific block sequences, and the number of interactions with the system. The results revealed three clusters of students: exploratory, disorganized, and near-miss. In their follow-up work, Wiggins et al. (2021) analyzed novices' hint requests in block-based programming. When a student asked for a hint, the time elapsed from the assignment's start and the percentage of code completion were recorded. Clustering of this data revealed five different groups of students based on their hint-taking strategies. For example, those that asked for a hint early and had low code completeness probably needed a "push" to start. Instructors can use this information to target the students' needs specific to the given group. Maennel (2020) performed a thorough literature review of data sources that can serve as evidence of learning in cybersecurity exercises. These data sources include timing information, command-line data, counts of events, and input logs. Our paper investigates the applicability of command-line data in educational assessment. Such data are collected in multiple state-of-the-art learning environments for cybersecurity training (Weiss et al., 2017; Andreolini et al., 2019; Labuschagne and Grobler, 2017; Tian et al., 2018) . Weiss et al. demonstrated that command-line data from cybersecurity training are valuable for student assessment. They incorporated information about the students' exact steps, rather than just a numerical score indicating success or failure. They analyzed the students' work processes and the utilized command-line tools. Based on the command histories, they generated progress models of student approaches (Weiss et al., 2016; Weiss et al., 2017; and predicted their success (Vinlove et al., 2020) . Mirkovic et al. (2020) collected and analyzed command-line input and output from participants in hands-on cybersecurity exercises. The analysis system automatically compared the collected data with pre-defined exercise milestones and produced statistics about the participants' progress. It helped identify difficult sections of the exercises and students needing assistance, providing useful information to instructors. Abbott et al. (2015) parsed a dataset of logs from cybersecurity training into meaningful blocks of activity and statistically analyzed them. McClain et al. (2015) further explored this dataset combined with questionnaires measuring the participants' experience in cybersecurity. They discovered that more experienced participants used specialized and general-purpose tools, while the less experienced participants focused only on specialized cybersecurity tools. Finally, several works investigated the assessment of teams in sophisticated cyber defense exercises. Granåsen and Andersson (2016) collected network and system logs to study the performance of teams. Similar data sources were used by Henshel et al. (2016) to assess and predict team performance. Maennel et al. (2017) proposed a systematic approach: a methodology to employ exercise data for team assessment. In contrast, we focus on individual assessment during exercises in the scope of classroom teaching. Pattern mining and clustering were applied in educational contexts with interesting results. They can reveal students' misconceptions, approaches to solving the tasks, and behavioral patterns. These insights can improve educational assessment and feedback and target instruction to support students' needs. The novelty of our paper is exploring these methods in the context of cybersecurity training. Previously, command-line data from cybersecurity training were analyzed using other methods, such as statistics, regular expression matching, and classifiers. We seek to discover insights gathered from cybersecurity training data using pattern mining and clustering, as well as demonstrate their usefulness for assessment. Moreover, we aim to uncover in-depth insights, not only assess the correctness of the student solution. This section explains the methods chosen to answer the research questions posed in Section 1.2. A visual overview of these methods is provided in Fig. 3 . In previous projects (Tkáčik, 2020; Popovič, 2021) , we prototyped the methods on smaller datasets, yielding initial results that we updated for this paper. Our research analyzes data from cybersecurity training. Specifically, we focus on offensive security skills training in a sandboxed network emulated within an interactive learning environment. The following text introduces essential aspects of the training to provide context for the research. Interactive learning environment The virtual machines for the training were hosted in KYPO Cyber Range Platform (Masaryk University, 2021; Vykopal et al., 2021) , which is a cloud-based infrastructure for emulating complex networks. For some training sessions, we alternatively used Cyber Sandbox Creator (Masaryk University, 2022a; Vykopal et al., 2021) : a tool for creating lightweight virtual labs hosted locally on the trainees' computers. This choice of the underlying infrastructure did not affect the training content, and the data collection was also equivalent. Both platforms are open-source , and cybersecurity instructors can freely deploy them for their purposes. Training format The trainees worked with the interactive learning environment either remotely via a web browser or locally on their computers. Each trainee accessed their own isolated sandbox containing a virtual machine with Kali Linux (Offensive Security, 2022a): an operating system distribution tailored for penetration testing that provided the necessary tools. The trainees completed a sequence of assignments presented via a web interface. Almost all the assignments were solved using command-line tools, which are described below. The participants were allowed to use any sources on the Internet. Moreover, the interactive learning environment offered optional hints, which the trainees could reveal to get help with the current task. The usage of hints and outside help was allowed since the trainees were not evaluated summatively (that is, the training was not a graded exam). Instead, we focused on formative assessment and helping the students explore new cybersecurity skills. Training content Each trainee participated in exactly one of two types of training. Both trainings involved attacking an intentionally vulnerable virtual host using wellknown security tools, but the trainings slightly differed in their content. In Training Fig. 3 The command logs collected from students act as input for pattern mining and clustering. The results are visualized and interpreted in Section 5 A (72 participants), the following tools were crucial: nmap for network scanning, Metasploit for exploitation, john for password cracking, and ssh for remote connection. Training B (41 participants) used nmap and ssh as well, but not Metasploit or john. Instead, it featured fcrackzip for cracking passwords to ZIP files (see Figs. 1 and 2) . None of the trainees was previously familiar with any of these two trainings. Again, the training content is publicly available (Masaryk University, 2022b). Training A corresponds to the cybersecurity game Secret laboratory and its derivatives, while Training B corresponds to the game Junior hacker training. Cybersecurity instructors can freely deploy these games in their classes and recreate the conditions for our research. Training participants From August 2019 to February 2021, we hosted 18 cybersecurity training sessions for a total of 113 trainees. Each training session usually took two hours to complete, and most of them were held remotely due to COVID-19 restrictions. The participants included: • undergraduate and graduate students of computer science from various European universities, • high school students attending the national cybersecurity competition, and • cybersecurity professionals. They all attended voluntarily because of their interest in cybersecurity and were not incentivized. Although the participants do not form a random sample, we argue that it is practically infeasible to recruit a randomized population for this type of research. Therefore, we instead worked with the representatives of the target group for this cybersecurity training. Ethical and privacy-preserving measures for research Since we carried out research with human participants, we ensured that the trainees would not be harmed in any way. We minimized the extent of data collection to gather only the data necessary for the research. We also received a waiver from our institutional ethical board since we do not collect any personally identifiable information. The participants provided informed consent to the collection and usage of their data for research purposes. The collected data were thoroughly anonymized not to reveal the trainee's identity. As a result, it is impossible to track the trainee throughout future training sessions. While the trainees solve the assignments, our infrastructure (Švábenský et al., 2021) automatically collects their submitted commands and the associated metadata. We gathered data from command-line tools in the Linux Bash terminal and Metasploit shell, which is software for penetration testing (Offensive Security, 2022b). These data, which are published (along with other training data) in an open-source article (Švábenský et al., 2021) , serve as the input for pattern mining and clustering. We did not collect data from tools with a graphical user interface. The command history of each trainee is captured in a single JSON file. The file consists of dozens of log records (78 per trainee on average), such that each record represents a single command executed by the trainee. Figure 4 shows an example of such a log record. Each log record has a fixed number of attributes. For our purposes, the most significant are: • timestamp, representing the time of the command's execution in the ISO 8601 format, • cmd, which represents the full command (the tool and its arguments) submitted by the trainee, and • cmd_type, the application used to execute the command: either "bash-command" for the tools executed within Linux Bash terminal, or "msf-command" for Metasploit shell. We collected 8834 commands, which constitute the dataset for this research, over the period of 1.5 years. Although this sample is not massive in volume, it captures the trainees' interactions deeply and over prolonged periods. Therefore, it fulfills the prerequisites of the chosen data mining methods. Hands-on cybersecurity training is usually held in a group of lower tens of participants. Therefore, we consider the 8834 commands to be sufficient for evaluating the two data mining methods. On average, this dataset corresponds to 78 commands per trainee within the 1-2-hour time frame, which is appropriate for the chosen training format. For this research paper, we focus on data processing after the training ends. Nevertheless, the used methods are applicable during the training for real-time assessment as well. To enable mining patterns from the command-line data, our analysis scripts written in Python automatically transformed the input data into the transaction and sequence databases described below. These databases are an internal representation of the input data, and they serve as the input for ARM and SPM algorithms, respectively. A key advantage of pattern mining is that the data preparation is the same for assessing any task from the training. We parsed the dataset of commands to create two transaction databases used as input for ARM. The command transaction database represents each submitted command as a separate transaction, and its goal is to reveal different properties of command usage. Each transaction contains four items that represent the attributes of the command: • tool, the name of the submitted command (e.g., nmap or ssh), • args, the command-line arguments supplied to the tool, • app, either Bash shell (Linux terminal) or Metasploit, • gap, the time difference between the current and the following command. For example, the command from Fig. 4 can become a single transaction {tool = nmap, args = --help, app = bash, gap = low}. To achieve better interpretability, the gap attribute was automatically discretized ,p. 102): divided into categorical classes from the set {low, medium, high, undefined}, since the exact value in seconds is not too important. We followed the method previously published by McCall and Kölling (2019) . First, the gap value in seconds was computed for each command. Then, gaps exceeding the arbitrary maximum of 20 minutes were discretized to "undefined". This resolved the cases of long periods of trainee inactivity. The interval cut-off points for "low", "medium", and "high" categories were computed based on the mean gap from all gaps not exceeding the maximum. The second database, called the tool transaction database, contains transactions with only two attributes: tool and gap. We merged the consecutive uses of the same tool (regardless of the arguments) into a single transaction. The gap represents the time difference between the first use of a tool and the next use of a different tool; the values were discretized as before. The motivation for creating this database was to determine the difficulty of using different tools. If a tool is associated with long gaps, it may indicate that the trainees were unfamiliar with this tool and had difficulties using it. Sequence databases Three sequence databases were created as input for SPM. All three had 113 sequences (corresponding to the number of trainees and the command log files), differing only in the contained items. The first database, called command sequence database, consists of sequences of executed commands. Each item represents a single command, both the tool and its arguments. For example, a sequence from this database can look like this: nmap --help, nmap 1.2.3.4, nmap -p 1000 1.2.3.4. The second database, tool sequence database, contains sequences of tools only. Data from both Bash and Metasploit applications are included in the first two databases. This allows discovering longer patterns, which more accurately reflect the trainees' progress. The third database, application sequence database, stores sequences of applications utilized by the trainees to execute commands. Its goal is to reveal a highlevel overview of alternating between applications. This database contains only two unique items: terminal, which includes all the commands executed in the Bash shell, and metasploit. Table 1 shows the number of transactions/sequences and unique items in each of our databases. Association rule mining For ARM, we used Apyori (Mochizuki, 2019) , the Python implementation of the Apriori algorithm. The MinSup threshold was manually tuned for each database since there is no simple method to determine it. The threshold was initially set to higher values and then gradually lowered to 0.01-0.04 until we reached a sufficient number of patterns manageable for interpretation. This approach is suggested by Fournier-Viger (2013a) since finding suitable values depends on the data and specific use case. The MinConf threshold is generally easier to set, because the database's properties influence MinSup more heavily than MinConf (Fournier-Viger et al., 2012). Since we were interested in rules with higher confidence, we used higher MinConf thresholds of 0.5. In contrast, MinSup needed to be much lower to extract a sufficient amount of rules. This was probably because our transaction databases contained many unique items relative to the total amount of transactions. If there were fewer unique items, MinSup could have been increased. Sequential pattern mining For SPM, we used an open-source data mining library SPMF (Fournier-Viger et al., 2016) . It provides optimized and documented implementations of more than 190 data mining algorithms (Fournier-Viger, 2021b) often used as benchmarks in research papers (Fournier-Viger et al., 2016) . We selected CloFast (Fumarola et al., 2016) , an efficient algorithm for mining closed sequential patterns. The MinSup threshold was experimentally set from 0.3 to 0.7. A popular density-based algorithm is OPTICS (Ordering Points To Identify the Clustering Structure) (Ankerst et al., 1999) , an improved extension of a widelyused DBSCAN algorithm (Tang et al., 2016) . For a data point to belong in a cluster, it must have at least MinPts points within its radius. The result of OPTICS clustering is a reachability plot. On the x-axis, it sorts all data points in the order of processing based on their similarity. Values on the y-axis represent the distance of a point from a previous one. Several similar points form a valley representing a cluster, while spikes represent noise or outliers (Ankerst et al., 1999) . In our research, we first represented each command as a Python object with the following attributes: tool, arguments, application type, and timestamp, simplifying the record in Fig. 4 . Then, we used the commands in two different feature matrices that later act as an input for clustering. (Pelánek et al., 2018) . Each text document is represented by a set of words it contains and their count. In our case, the "document" is a command history, and each tool is a "word". We disregarded the command's arguments since we would obtain too many unique features and impair the performance of the clustering algorithm. While the bag of words model captures the used commands, it does not consider other information available in the logs. Therefore, we selected five custom features to capture other insights into how the trainees progressed: • bash-count, the number of submitted Bash commands. A small number may suggest that a trainee did not progress far in training. The high number may indicate using a trial and error approach. • msf-count, the number of Metasploit commands a trainee used. Metasploit may be new for some trainees, and the high number of executed commands may indicate difficulties with this part of the training. • avg-gap, the average delay between two commands. Large gaps between commands may suggest the trainee did not understand how to use a tool and possibly looked for the information online. Small delays may indicate bruteforce guessing. • opt-changes, the number of times trainee used the same tool twice in a row but changed the options or arguments. A high count may show the trainee's unfamiliarity with the tool or inability to use it. • help-count, the number of times trainee displayed help information or manual page for any tool. It may also indicate the trainees' unfamiliarity with the tool. All features were standardized, namely scaled by their maximum absolute value (scikit-learn developers, 2021). We also checked the Pearson correlation between features, as a high value may make them redundant. While there was a correlation of 0.85 between bash-count and opt-changes, we preserved both because they capture different properties. All other features were correlated less (the absolute values ranged from 0.20 to 0.66). We chose the OPTICS algorithm to cluster our data. For calculating the distance between data points, we selected cosine similarity. This measure performs well on high-dimensional data and is often used to compute text similarity (Shirkhorshidi et al., 2015) . For example, the command nmap -sn -PS22 10.1.26.9 has the similarity of 0.6 with the command nmap --script=vuln 10.1.26.9 and approx. 0.32 with the command nmap --help. During the setup, OPTICS takes only one parameter MinPts: the minimum number of points required for cluster formation. Theory suggests setting the number to ln(n), where n is the number of points in the dataset (Birant and Kut, 2007) . For our dataset, the recommended value should be close to ln(113) ≈ 5, which we selected. This section answers the two research questions (RQ) about insights gathered from pattern mining and clustering. We visualize and interpret the findings from specific training sessions and subsequently compare the two approaches. We now describe and discuss the results revealed by ARM and SPM. The command transaction database revealed 51 association rules for Training A and 50 for Training B. Table 2 presents the selected rules marked as interesting by measures such as lift. The first row shows that in Training A, 64% of commands executed in Metasploit had small gaps (delay times). This can mean that using Metasploit involved a rapid sequence of simple commands, or that the trainees experimented with a trial and error approach. The high support of the rule (23%) can also indicate the overuse of Metasploit because it was needed only for one task in this training. Generally, tools without arguments were associated with small gaps and often with Bash terminal commands. This most likely implies that tools without arguments are easier and faster to use. On the other hand, if a tool had medium or large gaps, it was used in the Bash terminal as well. This is because Bash offers many tools with various difficulty levels, some of which offer a multitude of options. The tool transaction database provides further insight into the tool usage. Tools such as cd, ls, and cat, as well as Metasploit commands (use, set, show) were associated with small gaps. However, nmap was associated with large gaps in 72% of cases. This can indicate its difficulty of use or the long duration of the scan, which depends on the used arguments, as previously observed by Weiss et al. (2016) . The command sequence database in Training A revealed that trainees performed the Metasploit exploitation in various ways. Some steps were optional or performed in arbitrary order. When multiple approaches to a solution are possible, instructors can use this insight to show different examples in class, assess all the correct sequences as passed, or even discover novel solutions. Alternatively, when unsuitable subsequences are found, the trainees can be notified, corrected, or even penalized. In Training B, SPM showed that most trainees established an SSH connection only on the second or third try. When students learn error-prone actions, instructors should leave room for trial and error and not penalize the students for repeated tries. On the other hand, about a third of the trainees excessively used the ls tool (as much as 17 times within a single sequence, interleaved by other tools). Instructors should discourage unproductive behavior and maybe offer hints to students when such sequences are observed. The patterns from the tool sequence database show that in Training A, the participants usually progressed as instructors expected. They started with an nmap scan and proceeded with the Metasploit exploitation. This is visualized in Fig. 5 using a Sankey diagram. Nodes represent the items of the discovered patterns. Edges between the nodes represent subsequences of the patterns. The thicker the edge, the higher the support of the pattern in which the subsequence occurs. The canonical solution featured these steps in the following order: • nmap -scan the target IP address to discover available services; • search -find Metasploit exploits suitable for the discovered service based on the provided keyword; • use -select the correct exploit; • show options -display parameters of the exploit that need to be set; • set