key: cord-0045830-185hde7j authors: Nowak, Jakub; Holotyak, Taras; Korytkowski, Marcin; Scherer, Rafał; Voloshynovskiy, Slava title: Fingerprinting of URL Logs: Continuous User Authentication from Behavioural Patterns date: 2020-05-23 journal: Computational Science - ICCS 2020 DOI: 10.1007/978-3-030-50423-6_14 sha: fb0d670023e8187a878bc950e214f39e07dc60cc doc_id: 45830 cord_uid: 185hde7j Security of computer systems is now a critical and evolving issue. Current trends try to use behavioural biometrics for continuous authorization. Our work is intended to strengthen network user authentication by a software interaction analysis. In our research, we use HTTP request (URLs) logs that network administrators collect. We use a set of full-convolutional autoencoders and one authentication (one-class) convolutional neural network. The proposed method copes with extensive data from many users and allows to add new users in the future. Moreover, the system works in a real-time manner, and the proposed deep learning framework can use other user behaviour- and software interaction-related features. For the past twenty years, the Internet and its utilisation have grown at an explosive rate. Moreover, for several years computer network users have been using various devices, not only personal computers. We also have to manage with many appliances being constantly online and small Internet of Things devices. Efficient computer network intrusion detection and user profiling are substantial for providing computer system security. Along with the proliferation of online devices, we witness more sophisticated security threats. It is possible to enumerate many ways to harm networks, starting from password weakness. Malicious software can be illicitly installed on devices inside the network to cause harm, steal information or to perform large tasks. Another source of weakness can be Bring Your Own Device schemes, where such devices can be infected outside the infrastructure. At last, social engineering can be used to acquire access to corporate resources and data. Each network user leaves traces, some of them are generated directly by the user, e.g. on social networks, others are closely related to the computer network mechanisms. Thanks to network traffic-filtering devices, network administrators nowadays have an enormous amount of data related to network traffic at their disposal. Authorising users based on their behaviour can be done in many ways, depending on the available data and methods. Identification can be based on facial features [15, 17] or based on spoken instructions [2] . In [3, 11] data from smartphone sensors were used to analyse user behaviour. Similarly, the way how the user unlocks the smartphone [4] can be explored, and based on the collected data, they show the uniqueness of using the phone. This is related to certain user preferences, habits as well as to the physical conditions of individual users, i.e. the way the phone is held. Another option is to authenticate the user with the signature [16] ; in this solution, a signature is not only analysed as an image but also the dynamics of the signature creation using a haptic sensor. In our solution, we test whether the logged-in user has access to a given resource and does not impersonate someone else by breaking initial security measures based on, e.g., a password. Our research can be used in new generation firewall devices working in layer 7 of the OSI model however, in our case, the security rules will be based on the analysis of the pages visited. Our method provides a continuous authentication based on software interaction patterns. Last years brought learned semantic hashes to information retrieval. Semantic hashing [13] aims at generating compact vectors which values reflect semantic content of the objects. Thus, to retrieve similar objects, we can search for similar hashes which is much faster and takes much less memory than operating directly on the objects. The term was coined in [13] . They used for the first time a multilayer neural network to generate hashes and showed that semantic hashing obtains much better results than Latent Semantic Analysis (LSA) [7] used earlier. A similar method for generating hashes can be using the HashGAN network [5] ; this solution is based on generative adversarial networks [8] . In the presented solution, we use autoencoders to create semantic compact hashes for the behaviour of computer network users from their URL request sequences. Our approach is sparked by the aforementioned studies that use hashes to analyze data, especially in NLP. After training the autoencoders, we use the encoder parts to generate hashes and fed them to the input of a one-class convolutional network that performs the final user authentication (Architecture 2). Schematic diagram of the system located in the computer network infrastructure is presented in Fig. 1 . We also propose two smaller systems (Architecture 1 and 3) with worse accuracy. Through this research, we highlight the following features and contributions of the proposed system. -We present three different approaches to URL-based computer network user software interaction behavioural continuous authentication. Up to now, the network traffic was usually analysed by some hand-crafted statistics. -Our work provides new insights, showing that the system of autoencoders and convolutional neural network can be trained efficiently for one-class authentication for nearly any number of users. -The method can use nearly any kind of data as a features. -The proposed system is fast and can be used in real-time in various IT scenarios. The remainder of the paper is organised as follows. In Sect. 2, we discuss the problem of behavioural authenticating users in computer networks. The proposed data representation and three Architectures are presented in Sect. 3. Experiments on real-world data from a large local municipal network, showing accuracy and a comparison of three presented Architectures, are shown in Sect. 4. Finally, conclusions and discussions of the paper are presented in Sect. 5. The aim is to create an additional security layer to verify users in IT systems using data collected by computer network administrative infrastructure. The additional authorisation is carried out without the user's knowledge. The proposed system constantly monitors HTTP request patterns for every computer in the network. The requests come from browsing websites or applications sending queries with URL addresses. In other words, the method provides a software interaction-based behavioural biometrics. It should be remembered here that the addresses stored in the firewall logs apply to both pages opened in WWW browsers and programs running in the background, such as anti-virus or operating system updates. This article is based on data collected from a LAN network infrastructure, which is used by residents of four districts in Poland, as well as network users who are employees of the local government offices and their organisational units, e.g. schools, hospitals, etc. Internet access to the analysed network is done with the help of two CISCO ASR edge routers that route packets using RIP version 2. The data for neural network training was collected between June 2017 and January 2018, and for testing in February 2018. The size of raw logs was approximately 460 GB, and 0.9 GB after preprocessing (selecting time, date, user ID and URL). Based on the previously collected data, we examine whether an anomaly occurred, which is supposed to indicate a possible attack or use of the account by another user. To solve the problem, we use autoencoders and convolutional networks using various methods of data representation. We divided the task into several stages. The first is to create a session based on registered URL logs from the database. In our case, a session means a set of consecutive, non-repeating URLs for a given user. Each URL address was truncated to 45 characters. This size comes from the average length of addresses and observation of their construction. We have assumed that the most important information is at the beginning of the address. We primarily care about the domain of the website visited, the protocol that was used, and the basic parameters of the GET method. When creating a session, it happened that a query consisting of several URLs was sent at the same time. In this case, certain sets of addresses were associated in a very specific way; between two identical addresses, another one different from the others was added. To remove duplicate addresses, an additional sorting by address name was used. An example of a set requested at the same time is "address1 address2 address1 address3 address1". We tend rather to have the set "address1 address2 address3". This is an exceptional case; however, omitting increases the authenticating CNN error. The session to be analyzed consists of a minimum of 20 different URLs with a maximum length of 45 characters included in the dictionary. Another limitation was the maximum size of the session to be analyzed. We used up to 200 different URLs for fast neural network operation. The interval between the recorded addresses in one session cannot be longer than 30 min. If this time has been exceeded, successive addresses form a new session. We scrutinized three neural configurations: Architecture 1 with a convolutional network (CNN) with two-dimensional filters used for text classification, Architecture 2 consisting of one-class CNN with unique autoencoder for every user, and Architecture 3 with one-class CNN network and one autoencoder for all the users. A URL is an address that allows locating a website on the Internet. The user encounters it mainly when using a web browser. However, URLs can be requested by applications running in the background such as antiviruses, system updates, etc. Each user's computer uses different applications and at a different time, which allows to even better distinguish them. Very often text is represented by some dictionaries. The construction of the URL has been repeatedly addressed in various articles [1, 20] . The majority of the previous works created some handcrafted features based on URL statistics. URLs do not consist regular words; thus using word dictionaries is not viable. In our experiment, we encode the entire address without dividing its subsequent parts. The condition is that the characters that constitute the URL should be in the previously defined set containing 64 unique characters. We were inspired here by Zhang et al. [19] to present text data in the form of a one-hot vector at the character level. The dictionary consisted of the following characters: abcdefghijklmnopqrstuvwxvyz 0123456789 -;.!?:/\| @#$%^& *~'+=<>()[] If a character was not from the above alphabet it was removed. At the input of the neural network, in addition to the session of addresses used for classification, we also provide user identification data. In our case, we concatenate the user ID data with the URL input session. Therefore, in one input column, we have two values equal to 1, the rest of the rows of one column are filled with zeros. The first one is placed on the position in the range <1, 64> which defines the letter from the dictionary, the second one on the position in the range <65, 119> denoting the user ID, because each user has its unique number. The input size to the CNN network in Architecture 1 was 119 × 4096, where 4096 means the maximum length of the session we can provide at the network input. The coding scheme is presented in Fig. 2 . User ID URL session length Fig. 2 . URL text coding scheme for convolutional networks in Architectures 1-3. The upper part is one-hot character-level text encoding, and the lower part is one-hot user ID encoding. Our first attempt was to use a convolutional neural network with twodimensional filtering presented in Fig. 3 , similar to [10] . The network architecture is as follows: This method proved to be a weak solution to the problem because the single convolutional neural network had to cope with a highly complex problem and demonstrated a significant classification error. The accuracy of the anomaly recognition exceeded barely 63%, which is slightly higher than the random response and not viable in real-world computer network infrastructure. Architecture 2 was inspired by the one-class neural network [6] , and here each user (class) has a different autoencoder. The task of the network is to detect an anomaly in a given class. In our case, we add the user ID to "ask" the network whether given network traffic belongs to this particular user. Initially, we tried to use modified convolutional networks of the U-Net [12] structure without connections between feature maps of the same size (skip connections), what transpired to have an unacceptable training error. In the decoder part of the autoencoder, we implemented a sub-pixel convolutional layer in one dimension inspired by [14] . It changes the size of the input to the convolutional layer by increasing the width of the channels at the expense of their number. Training data for the autoencoder is created similarly to one sentence in NLP consisting of URLs instead of words. We do not use a separator between URLs; addresses are given as words in a sentence. The structure of the autoencoder for text is different from the structure of the autoencoder for images; here we were inspired somehow by [18] . In our case, the latent space (bottleneck) layer in the autoencoder is 128 × 64. In the adopted architecture with pooling, each addition of a layer reduces the size of the smallest, latent space layer in the autoencoder. The size of the latent space is a trade-off determined experimentally between the accuracy and the input size to the one-class CNN. The autoencoder structure used in the article is presented in Fig. 4a) , and the detailed meta-parameters are as follows: Our solution also utilizes a discriminator as a convolution network. The problem posed by us was to check whether the recorded session belongs to the user and whether a given set of URLs could be generated by a specific user. It was, therefore, necessary to create a suitable discriminator for the autoencoder. The idea of the discriminator is similar to the GAN network [8] ; we only care about assessing the mapping of data in the autoencoder and whether the given set could be created by a specific user. The user identifier has been moved in relation to Architecture 1 from the system input to the discriminator input, i.e. the last coding (latent space) layer in the autoencoder, the user coding method is identical as in the case of Architecture 1. The autoencoder, in this case, is unique for each user (unlike in Architecture 3). The discriminator input in Architecture 2 and 3 was 183 × 64, where 183 is made up from 128 (column size with hash from the autoencoder latent space) and 55 (user ID added in the same way in Architecture 1). The value 64 comes from the number of feature maps from the autoencoder. Detailed metaparameters of the discriminator in Architecture 2 and 3 are as follows: We combined the two previous frameworks to create something in between in terms of size and complexity. Creating separate autoencoders for each user is somehow problematic in terms of logistics, where it is easy to make a mistake in processing sessions for the user. Here we use one autoencoder as a uniform way of representing URLs for all users. The size of the autoencoder is the same as in the previous Architecture 2. This solution improved the results of the CNN from Architecture 1; however, it was worse than Architecture 2 with user-wise unique autoencoders. We performed experiments on a database with logs of visited URLs for 55 users. In the case of our database, the most active user had 6,137 training sessions, and the least active user had 440 URL sessions, which was about 14 times less (details are presented in Table 3 . In training all Architectures 1-3, we had to generate illegitimate user sessions for the training purposes. We did not decide to create synthetic user sessions because this is a challenging issue, and it could result in generating data different from the existing distribution. In our research, the data that the discriminator has to evaluate negatively was created based on existing sessions. However, we provide them to the network as if they belonged to another user. These sessions are randomly selected from among the entire dataset. The cross-entropy with softmax loss function from the CNTK package was used to train CNNs (discriminators). The training coefficient for CNNs was taken on as follows: 0.0008 for 5 epochs, 0.0002 for 10, 0.0001 for 10, 0.00005 for 10 epochs, and 0.00001 from then on to a maximum of 300 learning epochs. We used the binary cross-entropy loss function for training all the autoencoders. The best universal effects for each user were obtained when the learning coefficient was 0.0001 for the first two epochs then 0.00001 for 200 epochs. We trained all the architectures with the momentum stochastic gradient descent (SGD) algorithm with momentum set at 0.9 for both autoencoder and CNN. To assess the accuracy of the autoencoder (Table 2) , the Sørensen similarity coefficient QS = 2C A+B was used, where A and B are compared elements, in our case the output and input to the autoencoder and C is the number of elements common for both layers. After each convolutional layer we used Batch Normalization [9] with RELU activation function. The results are summarized in Table 1 . The best solution turned out to be Architecture 2 using dedicated autoencoders and one discriminator. The detailed results for each user are presented in Table 3 . The number of training sessions for a given user does not affect accuracy. During the implementation of neural networks, the only limitation turned out to be the available GPU memory. We used four Nvidia GTX 1080 Ti graphics cards with 11 GB of memory. In the case of training autoencoders, we could divide each task into four graphics cards without the need for special techniques enabling multiprocessing on multiple graphics cards. Autoencoder networks can be trained independently of each other. To speed up the learning of the discriminator, we use data from the autoencoder training instead of computing the encoder output again. Otherwise, only 40 users could be trained on the aforementioned equipment without using this technique. To train more users, a machine with more GPU memory is required. This limitation, however, was not valid after training and if we had a machine with more memory only for training, it would be possible to use the trained system on the equipment we had at our disposal. Another indicator is the number of URL sessions processed by the system in a specific time. Using the above-mentioned GPU, we are able to process a set (mini-batch) of 200 sessions in 36.8 ms per package, which shows that the system can be used in real-time scenarios. This is the number of sessions that can be loaded at once on a GPU with 11 GB memory. Our system based on autoencoders and one-class CNN, is a new approach to system security and anomaly detection in user behaviour. It provides continuous authentication of computer network users by software interaction analysis. We used real text data from the network traffic instead of hand-crafted traffic statistics as in the case of the previous approaches. Moreover, the proposed framework is a universal anomaly detection system applied in the paper for user authentication. We proposed three architectures that differ in size and complexity. The best architecture presented in the paper allows to add and remove any user without having to retrain the whole system. Thanks to this, we can save both time and computational resources. The presented solution can be used for other behavioural security solutions to create user profiles utilizing other available data. Future research would involve alternate methods to create autoencoders to improve accuracy. In the presented article, we used the same way of training the autoencoder for each user. To obtain better results, it would be beneficial to create a dedicated autoencoder architecture for each user, but this would involve a change in the implementation of the discriminator. Lexical feature based phishing URL detection using online learning Voice biometrics: deep learning-based voiceprint authentication system Hold and sign: a novel behavioral biometrics for smartphone user authentication AnswerAuth: a bimodal behavioral biometricbased user authentication scheme for smartphones HashGan: deep learning to hash with pair conditional Wasserstein GAN Anomaly detection using one-class neural networks Indexing by latent semantic analysis Generative adversarial nets Batch normalization: accelerating deep network training by reducing internal covariate shift An empirical study on network anomaly detection using convolutional neural networks A survey on behavioral biometric authentication on smartphones U-Net: convolutional networks for biomedical image segmentation Special Section on Graphical Models and Information Retrieval Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network Deep learning face representation by joint identification-verification Secure behavioral biometric authentication with leap motion Sparse discriminative multi-manifold embedding for one-sample face identification Byte-level recursive convolutional auto-encoder for text Character-level convolutional networks for text classification A novel lightweight URL phishing detection system using SVM and similarity index Acknowledgement. The project financed under the program of the Polish Minister of Science and Higher Education under the name "Regional Initiative of Excellence" in the years 2019-2022 project number 020/RID/2018/19, the amount of financing 12,000,000.00 PLN.