International Journal of Advanced Network, Monitoring and Controls      Volume 05, No.04, 2020 

DOI: 10.21307/ijanmc-2020-036                            40 

Review of Anomaly Detection Based on Log Analysis 

 
Wu Xudong 

Laboratory of Wireless Network and Intelligent System 

Xi'an Technological University 

Xi'an, 710021, China 

E-mail: wuxudong_wxd@163.com 

 
Abstract—The development of the Internet and the 

emergence of large-scale systems promote the rapid 

development of society, and bring a lot of convenience to 

people. Then comes the problem of network security, 

privacy theft, malicious attacks and other illegal acts 

still exist, a qualified software system will log the key 

operation behavior of the software. Therefore, log 

analysis has become an important means of anomaly 

detection. Based on log analysis, this paper consulted the 

related literature on anomaly detection, elaborated the 

research status of anomaly detection based on log 

analysis from the aspects of template matching, rule 

self-generation and outlier analysis, and analyzed the 

challenges faced by anomaly detection based on log 

analysis. 

Keywords-Log Analysis; Distributed; Big Data; 

Anomaly Detection 

I. INTRODUCTION 

With the development of the Internet, big data and 

artificial intelligence have penetrated into people's 

lives, unknowingly changing the way people live, 

food, and transportation, making people's lives faster, 

more efficient, and easier. Research in various fields 

of computer is moving towards bionics, including 

human-like big data processing, human-like computer 

vision and image processing, human-like voice input, 

etc. These studies make the computer in a domain not 

only Clairvoyance, Shunfeng ear, can also save and 

process a large number of various types of data 

obtained from various aspects, forming an invisible 

"superman" individual. 

In the past 20 years, with the rapid development of 

the Internet in China, people's lifestyles have 

undergone tremendous changes. Chinese Internet 

users continue to grow. According to CNNIC’s 44th 

"Statistical Report on Internet Development in China", 

as of June 2019, the number of Internet users in China 

reached 854 million, an increase of 25.98 million from 

the end of 2018, and the Internet penetration rate 

reached 61.2%, compared with the end of 2018. An 

increase of 1.6 percentage points. The proportions of 

using desktop computers, laptop computers and tablet 

computers to surf the Internet were 46.2%, 36.1% and 

28.3% respectively. These not only reflect the 

continuous increase in the number of netizens, but 

also the rapid and continuous growth of log data from 

the side. 

The log records the time point selected by the 

developer that is worthy of attention and the changes 

of state or event that is worthy of attention at this 

point in time. It is the most important source of 

information for understanding the operating status of 

the system and diagnosing system problems. 

Traditionally, system maintainers use tools such as 

grep and awk to filter keywords such as "error" or 


International Journal of Advanced Network, Monitoring and Controls      Volume 05, No.04, 2020 

41 

"exception" in the log to find problems in system 

operation. When the filtering keywords cannot meet 

the demand, more experienced personnel will write 

scripts to impose more complex filtering rules. The 

cost of this method is very high, writing effective 

scripts requires a deep understanding of the target 

system, and these scripts written for specific target 

systems cannot be applied to other systems, and their 

versatility is poor. But even without considering the 

cost, this approach has become no longer feasible for 

today's software systems. 

The ever-increasing log data scale and network 

security issues make network managers face severe 

challenges: not only need to ensure the stable and 

efficient operation of the network, but also need to 

provide secure network services as much as possible. 

Fortunately, in recent years, distributed computing 

technology has become more mature. Distributed 

computing platforms such as Hadoop, Spark, Flume, 

Storm are being accepted and applied by more and 

more companies, and are gradually being used in 

various industries for data storage. And online or 

offline analysis, which brings opportunities for log 

data anomaly detection. 

At the same time, issues such as security and 

privacy in the network have also emerged. Distributed 

denial of service attacks, zombie codes, Trojan horses, 

ransomware, worms and other malicious software 

have a great negative impact on people’s lives. Once 

the malware operates, it may cause irreversible losses 

to the company’s economy. , Poses a great threat to 

people's privacy. A study showed that [1][2]: Random 

sample surveys of large-scale systems, more than half 

of the system failure problems were not logged. At 

this time, maintenance personnel are required to 

manually find the cause of the problem. Due to the 

large amount of code, The time invested is much more. 

High-quality software code can greatly help the 

detection efficiency after a program error occurs. Log 

records at key locations are an important means to 

ensure that the abnormality can be quickly located and 

repaired. Therefore, it is necessary to add log records 

to key positions of the program, and log analysis has 

become an important method of anomaly detection. 

The Internet brings convenience to our life, but 

also brings a series of network security problems. The 

main characteristics of Internet security problems are 

as follows: a variety of types, all the time, causing 

huge losses. All kinds of human attacks, mis-operation, 

network equipment failure will bring network security 

problems. Distributed denial of service attack, zombie 

code, Trojan horse, blackmail program, worm virus 

and other malicious software appear frequently. Once 

the malicious software operates, it may cause 

irreparable loss to the company's economy and cause 

great impact on people's life. 

The log records the time points that developers 

choose to pay attention to and the changes of states or 

events at this time point. It is the most important 

information source to understand the system operation 

status and diagnose system problems. Traditionally, 

system maintenance personnel use grep, awk and 

other tools to filter keywords in logs, such as "error" 

or "exception", to find problems in system operation. 

When the filtering keywords can't meet the 

requirements, senior personnel will write scripts to 

impose more complex filtering rules. The cost of this 

method is very high, writing effective scripts needs to 

have a deep understanding of the target system, but 

these scripts written for specific target system can not 

be applied to other systems, and the generality is very 

poor. But even without considering the cost, this 

approach is no longer feasible for today's software 

systems. 

With the increasing scale of log data and network 

security issues, network managers are facing severe 

challenges: not only need to ensure the stable and 

efficient operation of the network, but also need to 

provide as much as possible secure network services. 

Fortunately, in recent years, the distributed computing 


International Journal of Advanced Network, Monitoring and Controls      Volume 05, No.04, 2020 

42 

technology is becoming more and more mature. 

Hadoop, spark, flume, storm and other distributed 

computing platforms are being accepted and applied 

by more and more enterprises, and are gradually 

applied to various industries for data storage and 

online or offline analysis, which brings opportunities 

for log data anomaly detection. 

This article first talks about the related knowledge 

of log analysis and anomaly detection, and then 

summarizes the current research status of log anomaly 

detection from the aspects of template matching, rule 

generation and outlier analysis, analyzes and classifies 

the articles that have been read, and summarizes the 

current The types and rules of log anomaly detection 

are found to be difficult to solve during the detection 

process. Finally, the future work of anomaly detection 

based on log analysis is summarized. 

II. RELATED TECHNOLOGIES AND CONCEPTS OF 

LOG ANOMALY DETECTION 

A.  Log analysis 

The log in the computer is a record of events 

generated with the operation of network equipment, 

applications, and systems. Each line records the date, 

time, type, operator, and description of related 

operations. Figure 1 shows a partial log record of the 

application. In reality, the log data generated by a 

system is very large, conforming to the 4V 

characteristics defined in big data, namely, volume, 

variety, velocity, and value. These log data will only 

occupy storage space if they are shelved, and will 

bring unlimited value if they are properly used. 

Because these log data have 4V characteristics, it also 

determines that manual analysis of these data is 

unrealistic, and log analysis tools must be used to 

make full use of the value of log data. 

 
Figure 1. Part of the application log 

Here are several current mainstream log analysis 

tools. Slunk is a full-text search engine for machine 

data and a hosted log management tool. Its main 

functions include: log aggregation, search, meaning 

extraction, grouping, formatting, and visualization of 

results. ELK is composed of three parts: elasticsearch, 

logstash, and kibana. Elasticsearch is a near real-time 

search platform. Compared with MongoDB, 

elasticsearch has more comprehensive functions and is 

very capable of performing full-text search. It can 

index, search, and sort documents, filter. Logstash is a 

log collection tool, which can collect various 

messages from local, network and other places and 

send them to elasticsearch. Kibana provides a visual 

interface on the web and has a cool dashboard. 

B.  Store log data 

Due to the huge amount of log data and 

semi-structured data, the traditional structured 

database can not meet the storage requirements of log 

data. HDFS (Hadoop distributed file system) can 

provide high-throughput data access, which is very 

suitable for large-scale data sets, and it is suitable for 

deployment on low-cost machines, which can meet the 


International Journal of Advanced Network, Monitoring and Controls      Volume 05, No.04, 2020 

43 

storage requirements of log data. In the experiment, 

the log data generated by the system needs to be 

stored in HDFS. The configured HDFS will 

automatically back up the data. The input data file is 

divided into fixed size blocks. The general size of the 

data block is 128MB. Each data block is stored in 

different nodes. Generally, each data block has three 

copies. The first copy is stored in the same node as the 

client, the second replica exists on a node in a 

different rack, and the third replica exists on another 

node in the same rack as the second replica. 

C.  Log data preprocessing 

Log data preprocessing has three goals: 

 filtering "non-conforming" data and cleaning 

meaningless data; 

 format conversion and regularization; 

 filtering and separating various basic data with 

different needs according to the subsequent 

statistical requirements. 

In terms of filtering "non-conforming" data and 

cleaning meaningless data, the log data generated by 

the system may be "non-conforming" or meaningless. 

Before the data format conversion and normalization, 

a judgment needs to be added to check whether the 

data is standard and intentional. If not, the data is 

considered useless and jumps to the next data directly. 

In terms of format conversion and regularization, we 

first analyze the characteristics of the data. The fields 

in each record are separated in the form of spaces. 

According to this feature, each record is segmented 

according to the space as the standard. For the fields 

with spaces inside, we need to use regular matching 

for special processing. After segmentation, each field 

is normalized, including time format conversion, 

number type conversion, path completion, etc. In the 

aspect of filtering and separating data with different 

needs, the required fields are extracted according to 

the needs of subsequent detection algorithms. 

D.  Anomaly detection 

Anomalies usually include outliers, fluctuation 

points and abnormal event sequences. Generally, 

given the input time series X, the outliers are 

timestamp value pairs (t, Xt), where the observed 

value xt is different from the expected value of the 

time series, then the observed value Xt is an outlier. 

Fluctuation point refers to a given input time series X, 

at a certain time t its state or behavior in this time 

series is different from the values before and after T. 

An abnormal time series is a part of a given set of time 

series X={Xi} that belongs to X but is inconsistent 

with most time series values on X. The abnormal point 

is given in the box in Figure 2. 

Peng Dong et al. [3] divided anomaly detection 

methods into three categories: techniques based on 

statistical models, techniques based on proximity, and 

techniques based on density. 

 
Figure 2. Outlier feature 

In data mining, anomaly detection identifies items, 

events, or observations that do not match the expected 

pattern or other items in the dataset. Usually, abnormal 

items will turn into bank fraud, structural defects, 

medical problems, text errors and other types of 

problems. Anomalies are also known as outliers, 

novelty, noise, bias, and exceptions. 

Especially in the detection of abuse and network 

intrusion, interesting objects are often not rare objects, 

but they are unexpected activities. This pattern does 

not follow the usual statistical definition of outliers as 

rare objects, so many anomaly detection methods will 

fail to deal with such data unless appropriate 

aggregation is carried out. On the contrary, clustering 

analysis algorithm may be able to detect the micro 

0

20

40

60

0 10 20 30
anomalous 


International Journal of Advanced Network, Monitoring and Controls      Volume 05, No.04, 2020 

44 

clustering formed by these patterns. 

There are three types of anomaly detection 

methods. Under the assumption that most instances in 

the dataset are normal, the unsupervised anomaly 

detection method can detect the unlabeled test data by 

finding the most unmatched instance with other data. 

Supervised anomaly detection requires a dataset that 

has been labeled "normal" and "abnormal" and 

involves training classifiers (the key difference from 

many other statistical classification problems is the 

inherent imbalance of anomaly detection). The semi 

supervised anomaly detection method creates a model 

representing normal behavior based on a given normal 

training data set, and then detects the possibility of test 

cases generated by the learning model. 

III. RESEARCH STATUS OF LOG ANOMALY 

DETECTION TECHNOLOGY 

Anomaly detection refers to the process of finding 

data patterns that do not meet expectations from the 

data [4]. Anomaly detection behavior based on log 

data can be regarded as a classification problem in 

essence, that is to say, distinguish normal behavior and 

abnormal behavior from a large amount of abnormal 

log behavior data, and determine the specific attack 

method from the abnormal behavior [5]. When the 

server is running, the log will record and generate the 

behavior of the user throughout the access process. 

You can find the information of abnormal users by 

processing the information in the log. Therefore, 

analyzing logs has become one of the most effective 

methods to detect abnormal user behavior [6,7,8]. 

With the rapid development of big data, log-based 

anomaly detection methods are divided into three 

categories: model-based technology, proximity-based 

technology and density-based technology. 

A.  Model-based technology 

Model-based technology first builds a data model. 

Anomalies are objects that the model cannot fit 

perfectly. Since abnormal and normal objects can be 

regarded as defining two different classes, we can use 

classification techniques to build models of the two 

classes. However, the training set is very important in 

the classification technology. Because anomalies are 

relatively rare, it is impossible to detect new 

anomalies that may appear [9]. Wang Zhiyuan et al. 

[10] used the log template to detect anomalies in 2018. 

The log was cleaned first, and then the edit distance 

was used to cluster the text to form the log template. 

On the basis of the log template, TF-IDF (Word 

Frequency-Inverse File) was used. Frequency) to form 

a feature vector, and then use logistic regression, 

Bayesian, support vector machine and other weak 

classifiers to train to obtain the score feature vector, 

build a strong classifier based on the score feature 

vector and random forest, and finally use mutual 

information to detect the truth The correlation between 

the template and the clustering template, the accuracy 

and recall rate are used to detect the classification 

effect and various classifiers are compared. Siwoon et 

al. [11] proposed a new data storage and analysis 

architecture based on Apache Hive to process a large 

amount of Hadoop log data, using average movement 

and 3-sigma technology to design and implement 

three anomaly detection methods, these three methods 

They are the basic method, linear weight method and 

exponential weight method. The first method 

calculates the average line and standard deviation of 

anomaly detection, but there are repeated detections. 

In order to solve this problem, there are two other 

weighted detection methods, namely linear weighting 

and exponential weighting. In the linear weighting 

method, the weight is given in proportion to the 

position of the log item, and the exponential weight 

method is to give the weight exponentially on top of 

the basic method. Finally, the effectiveness of the 

proposed method is evaluated in a hadoop 

environment with a name node and four data nodes. 

Fu et al. [12] proposed a technique that does not 

require any specific application knowledge for 

anomaly detection in unstructured system logs, 


International Journal of Advanced Network, Monitoring and Controls      Volume 05, No.04, 2020 

45 

including a method of extracting log keys from free 

text messages. The false positive rate of their 

experiments under the Hadoop platform is about 13%. 

Xu et al. [13] used source code to match the log 

format to find the relevant variables, extracted the 

features of the corresponding log variables through the 

bag-of-words model, and then used these features to 

reduce the dimensionality through the principal 

component analysis method, according to the 

maximum separability of principal component 

analysis Detect abnormal log files, and finally use a 

decision tree to visualize the results. Fronza et al. [14] 

used the operation sequence represented by the 

random index, characterized the operation in each log 

according to its context, and then used the support 

vector machine to correlate the sequence to the fault 

or non-fault category to predict system failure . Peng 

et al. [15] applied text mining technology to classify 

messages in log files as common cases, improved 

classification accuracy by considering the time 

characteristics of log messages, and used visualization 

tools to evaluate and verify the effective time for 

system management mode. 

B.  Technology based on proximity 

Proximity-based technology considers proximity 

measures between objects, such as "distance". Zhang 

Luqing et al. [16] proposed a web attack data mining 

algorithm based on the anomaly degree of outliers, 

which first clustered HTTP requests, and then 

proposed a detection model that approximates normal 

distribution. The algorithm first finds the arithmetic 

mean of each numerical attribute value and the most 

frequently occurring value in each categorical attribute 

as the centroid of the numerical attribute and the 

centroid of the categorical attribute, and after 

synthesis, the centroid T of the data set is obtained, 

and the distance between the object p and the centroid 

T is obtained. As the abnormality of p. Finally, 

experiments have confirmed that the algorithm has a 

higher detection rate. Jakub Breier et al. [17] proposed 

a log file anomaly detection method, which 

dynamically generates rules from certain patterns in 

sample files and can learn new types of attacks while 

minimizing the need for human behavior. The 

implementation uses the Apache Hadoop framework 

to provide distributed storage and distributed data 

processing to support parallel processing to speed up 

execution. Since the incremental mining algorithm 

based on the local outlier factor requires multiple 

scans of the data set, Zhang Zhongping et al. [18] 

proposed a flow data outlier mining algorithm 

(SOMRNN) based on inverse k nearest neighbors. 

Using the sliding window model to update the current 

window requires only one scan, which improves the 

efficiency of the algorithm. Grace et al. [19] used data 

mining methods to analyze Web log files to obtain 

more information about users. In their work, they 

describe the log file format, type and content, and 

provide an overview of the Web usage mining process. 

Liang Bao et al. [20] proposed a general method for 

mining console logs to detect system problems. First 

give some formal problem definitions, and then 

extract a set of log statements in the source code and 

generate a reachability graph to show the reachability 

of log statements. After that, log files are analyzed to 

create log messages by combining information about 

log statements with information retrieval techniques. 

The grouping of these messages is tracked according 

to the execution unit. A detection algorithm based on 

probabilistic suffix tree is proposed to organize and 

distinguish the significant statistical characteristics of 

sequences. Experiments were conducted on the 

CloudStack test platform and Hadoop production 

system, and the results showed that compared with the 

existing four algorithms for detecting abnormalities, 

this algorithm can effectively detect abnormal 

operation. Since there are fewer abnormal points in 

reality, Liu et al. [21] proposed an anomaly detection 

algorithm based on isolation. The isolation tree 

created can quickly converge but requires 

sub-sampling to achieve high accuracy. 


International Journal of Advanced Network, Monitoring and Controls      Volume 05, No.04, 2020 

46 

C.  Density-based technology 

Density-based technology considers objects in 

low-density areas as abnormal points. The 

density-based local outlier detection algorithm (LOF) 

has high time complexity and is not suitable for outlier 

detection of large-scale data sets and high-dimensional 

data sets. Wang Jinghua et al. [22] proposed a local 

outlier Point detection algorithm NLOF. Li Shaobo et 

al. [23] proposed a density-based abnormal data 

detection algorithm GSWCLOF. The algorithm 

introduces the concept of sliding time window and 

grid. In the sliding time window, the grid is used to 

subdivide the data, and the information entropy is 

used for all The data in the grid is pruned and filtered 

to eliminate most of the normal data, and finally the 

outlier factor is used to make a final judgment on the 

remaining data. Wang Qian et al. [24] proposed a 

density-based detection algorithm, which introduced 

the Local Outlier Factor (LOF), and judged whether 

the data is abnormal based on the LOF value of the 

data. The algorithm is only suitable for static data 

detection. Once the amount of data fluctuates, it is 

necessary to recalculate the LOF value of all data. The 

algorithm has poor adaptability and is not suitable for 

the detection process of dynamic data. Pukelsheim et 

al. [25] assumed that the data sample obeys a 

univariate Gaussian distribution, and judged the test 

sample that is outside of the distance twice or three 

times the variance as abnormal. 

IV. CHALLENGES FACED BY LOG ANOMALY 

DETECTION TECHNOLOGY 

There are several obstacles from the time the 

system abnormality occurs to the successful detection 

of the abnormality: 

 The exception log is not recorded 

 The format of exception log records is not 

standardized 

 The exception log cannot be sent to the 

processing end in time 

 Abnormal log sending is lost 

 The detection algorithm is not accurate enough 

Any occurrence of one or more of the above 

conditions will result in failure of the anomaly 

detection result. 

A.  Real-time 

The purpose of anomaly detection is to find 

anomalies and find a corresponding method to float 

the anomaly, and the time delay from logging, to 

anomaly detection, to manual analysis, and to 

anomaly elimination is too long, which leads to 

anomalies that exist for too long. The losses were 

more serious. If real-time performance can be 

guaranteed, the efficiency of exception elimination 

will be greatly improved. 

B.  Detection accuracy 

Anomaly detection has various factors that affect 

its accuracy, such as irregular log format, 

inappropriate algorithm, etc. These problems directly 

lead to a decrease in the accuracy of anomaly 

detection, which also determines that log anomaly 

detection cannot be completely separated from the 

intervention of technicians. 

Even if the same benchmark data set is used in the 

literature for anomaly detection, most of them do not 

indicate the size or proportion of labeled data. Even 

the size of training and test sets and evaluation 

indicators are different. Different measurement 

combinations make the research results unable to 

compare with each other 

C.  The versatility of detection algorithms 

At present, there are many anomaly detection 

algorithms at home and abroad, such as: Isolation 

Forest, One-Class SVM, Robust covariance, K-means, 

Principal Component Analysis, 3-ε, etc. These 

algorithms have their own advantages and 

disadvantages and are not suitable for all anomaly 

detection. However, due to its unstructured and 

non-identical characteristics of logs, a specific 

algorithm is needed for a specific log, or a specific 


International Journal of Advanced Network, Monitoring and Controls      Volume 05, No.04, 2020 

47 

algorithm is improved to achieve a higher detection 

rate. The "localization" of the algorithm also requires 

specialized technical personnel to operate, which 

increases the cost of detection. 

D.  Tag data 

In the log data, there is a large amount of data, and 

there are very few abnormal data. It is very difficult to 

mark a small amount of abnormal data in a large 

amount of data. There is no such publicly marked data 

as the experimental basis, so anomaly detection 

encountered great difficulties. 

V. RESEARCH DIRECTION OF ANOMALY 

DETECTION 

Based on the current research status of anomaly 

detection technology and the above problems, the 

challenges and future research directions of anomaly 

detection are summarized as follows: 

Traffic data often have high characteristic 

dimensions, and the Euclidean distance in the 

sampling method can not measure the spatial 

distribution of the samples very well. The data 

distribution environment of supervised learning and 

semi supervised learning are different. Under 

unbalanced data, most of the existing semi supervised 

methods apply the traditional methods to semi 

supervised learning. Therefore, the traditional methods 

to solve the imbalance problem are not necessarily 

suitable for semi supervised learning and need further 

research. Although the research on data imbalance has 

achieved good results in the field of network security, 

there are very few researches on the imbalance 

problem in semi supervised learning. Most of the semi 

supervised methods applied in the field of anomaly 

detection use ensemble learning to solve the class 

imbalance. In the future, we can solve the problem of 

anomaly detection by combining the latest 

achievements in the field of data imbalance under 

semi supervision. 

At present, many network traffic feature selection 

and extraction are limited to one dimensional features 

or simple combination of multi-dimensional features, 

while traffic anomalies usually show in 

multi-dimensional features. How to effectively fuse 

multi-dimensional features, learn data flow features 

from multiple perspectives, and use a small amount of 

labeled data for semi supervised integration algorithm 

synthesis results to reduce information loss is a 

challenging research topic. 

Semi supervised dimensionality reduction is a 

feasible method in the field of anomaly detection. 

How to find a more effective way to deal with 

high-dimensional sparse samples and continuous 

variables and further improve the real-time 

performance of detection model is of great 

significance. 

The learning effect of the combination of active 

learning and semi supervised learning strategy is 

better than that of single method. The combination of 

semi supervised learning and active learning can 

actively find effective supervision information. 

Through effective supervision information, unlabeled 

sample data can be used better, thus improving the 

accuracy of the model and solving speed. However, 

the research on the combination of semi supervised 

learning and active learning is rare, and there is a large 

space for improvement. 

Incremental semi supervised anomaly detection is 

more in line with the actual anomaly detection. It 

makes full use of the data results processed before in 

the training process. It should have more in-depth 

research in the field of network security. In the future, 

we can consider introducing the incremental algorithm 

of natural language technology into specific anomaly 

detection. 

Semi supervised clustering algorithm uses the 

traditional clustering algorithm to introduce the 

supervised information to complete the semi 

supervised learning, so it can also expand the semi 

supervised clustering algorithm such as density 


International Journal of Advanced Network, Monitoring and Controls      Volume 05, No.04, 2020 

48 

clustering and spectral clustering. In addition, some 

traffic data are high-dimensional and sparse. However, 

most of the existing clustering algorithms are not 

suitable for processing high-dimensional sparse data. 

In future research, it is necessary to make further 

discussion. 

In general, semi supervised learning can help 

improve performance by using unlabeled data, 

especially when the number of labeled data is limited. 

However, in some cases, the selection of unreliable 

unlabeled data may mislead the formation of 

classification boundaries and eventually lead to the 

degradation of semi supervised learning performance. 

Therefore, how to use unlabeled data safely is a 

research focus in the future. 

It can combine multiple semi supervised anomaly 

detection methods and technologies to achieve more 

efficient network data detection and obtain more 

accurate prediction results. In addition, in semi 

supervised anomaly detection, it is a challenging 

research topic to minimize the additional impact on 

the network. 

VI. CONCLUSION 

Machine learning faces many challenges in the 

field of abnormal traffic detection. The biggest 

difficulty is the lack of label data. In practice, only a 

limited number of tagged data is available, while most 

of the data is unmarked. In addition, although there 

are a large number of normal access data, there are 

few abnormal traffic samples and various attack forms, 

which make it difficult to learn and train the model. 

Semi supervised learning is an effective solution, 

which can make use of both unlabeled data and 

labeled data, which can alleviate this problem. 

For anomaly detection based on log analysis, 

domestic and foreign countries have made certain 

progress and achieved various results. Various 

algorithms such as template matching, automatic rule 

generation, outlier analysis, and statistical data have 

certain effects, which are of great significance to 

network security and intelligent operation and 

maintenance. 

Future research will continue to focus on real-time 

performance to ensure that abnormalities can be 

detected as quickly as possible. Improve detection 

accuracy, minimize manual intervention or cancel 

manual intervention. Study the versatility of the 

algorithm, so that an algorithm can adapt to log 

analysis in different environments as much as 

possible. 

 
REFERENCES 

[1] Yuan D, Park S, Huang P, Liu Y, Lee MM, Tang X, Zhou Y, Savage 
S. Be conservative: enhancing failure diagnosis with proactive 
logging. In: Proc. of the 10th Symp. on Operating Systems Design 
and Implementation (OSDI). 2012. 293~306. 

[2] Yuan D, Park S, Zhou Y. Characterizing logging practices in 
open-source software. In: Proc. of the 2012 Int'l Conf. on Software 
Engineering. 2012. 102~112. [doi: 10.1109/ICSE. 2012.6227202]. 

[3] Peng Dong. Intelligent operation and maintenance: building a 
large-scale distributed AIOps system from zero. Electronic Industry 
Press, 2018.7 ISBN 978-7-121-34663-7 p198-p199. 

[4] Varun Chandola, Arindam Banerjee, Vipin Kumar. Anomaly 
Detection: A Survey[J]. Acm Computing Surveys, 2009, 41(3). 

[5] Davis J J, Clark A J. Data preprocessing for anomaly based network 
intrusion detection: A review[J]. Computers & Security, 2011, 
30(6-7):353-375. 

[6] Q. Lin, H. Zhang, J. Lou, Y. Zhang and X. Chen, "Log Clustering 
Based Problem Identification for Online Service Systems," 2016 
IEEE/ACM 38th International Conference on Software Engineering 
Companion (ICSE-C ), Austin, TX, 2016, pp. 102-111. 

[7] Pecchia A, Cotroneo D, Kalbarczyk Z, et al. Improving Log-based 
Field Failure Data Analysis of multi-node computing systems[C]. 
IEEE, 2011. 

[8] Tambe R, Karabatis G, Janeja V P. Context aware discovery in web 
data through anomaly detection[J]. International Journal of Web 
Engineering and Technology, 2015, 10(1):3. 

[9] Wang Xiaodong, Zhao Yining, Xiao Haili, Chi Xuebin, Wang 
Xiaoning. Detection method of abnormal log flow pattern in 
multi-node system [J/OL]. Journal of Software: 1-15 [2019-12-24]. 

[10] Wang Zhiyuan, Ren Chongguang, Chen Rong, Qin Li. Anomaly 
detection technology based on log template[J]. Intelligent Computers 
and Applications, 2018, 8(05): 17-20+24. 

[11] Son S, Gil MS, Moon YS. [IEEE 2017 IEEE International 
Conference on Big Data and Smart Computing (BigComp)-Jeju 
Island, South Korea (2017.2.13-2017.2.16)] 2017 IEEE International 
Conference on Big Data and Smart Computing (BigComp)-Anomaly 
detection for big log data using a Hadoop ecosystem[J]. 
2017:377-380. 

[12] Fu, Q., Lou, JG, Wang, Y., & Li, J. (2009). Execution anomaly 
detection in distributed systems through unstructured log analysis. In 
Proceedings of the 2009 ninth IEEE international conference on data 
mining, ICDM '09, (pp. 149–158). Washington, DC: IEEE Computer 
Society. doi:10.1109/ICDM. 2009.60. 


International Journal of Advanced Network, Monitoring and Controls      Volume 05, No.04, 2020 

49 

[13] Xu W, et al. Large-scale system problems detection by mining 
console logs[J]. Proceedings of the Acm Sigops Symposium on 
Operating Systems Principles Big Sky Mt, 2013:2009. 

[14] Ilenia Fronza, Alberto Sillitti, Giancarlo Succi, Mikko Terho, Jelena 
Vlasenko. Failure prediction based on log files using Random 
Indexing and Support Vector Machines[J]. Journal of Systems and 
Software, 2013, 86(1):2- 11. 

[15] Peng W, Li T, Ma S. Mining logs files for data-driven system 
management. ACM SIGKDD Explorations Newsletter, 2005, 
7(1):44-51. 

[16] Zhang Luqing. Web attack data mining algorithm based on outlier 
anomaly[J]. Ship Electronic Engineering, 2018, 38(09): 105-110. 

[17] Breier J, Jana Branišová. A Dynamic Rule Creation Based Anomaly 
Detection Method for Identifying Security Breaches in Log 
Records[J]. Wireless Personal Communications, 2015, 94(3):1-15. 

[18] Zhang Zhongping, Liang Yongxin. Algorithm for mining outliers in 
flow data based on anti-k nearest neighbors[J]. Computer 
Engineering, 2009, 35(12): 11-13. 

[19] Grace, L., Maheswari, V., & Nagamalai, D. (2011). Web log data 
analysis and mining. In N. Meghanathan, B. Kaushik, & D. 
Nagamalai (Eds.), Advanced computing, communications in 

computer and information science (Vol. 133, pp. 459–469). Berlin: 
Springer. 

[20] Liang Bao, Qian Li, Peiyao Lu, Jie Lu, Tongxiao Ruan, Ke Zhang. 
(2018). Execution anomaly detection in large-scale systems through 
console log analysis. The Journal of Systems & Software 143 (2018) 
172– 186. 

[21] Liu F T, Ting K M, Zhou Z H. Isolation-Based Anomaly Detection[J]. 
ACM Transactions on Knowledge Discovery from Data, 2012, 
6(1):1-39. 

[22] Wang Jinghua, Zhao Xinxiang, Zhang Guoyan, Liu Jianyin. NLOF: 
A new density-based local outlier detection algorithm [J]. Computer 
Science, 2013, 40(08): 181-185. 

[23] Li Shaobo, Meng Wei, Wei Jinglei. Density-based abnormal data 
detection algorithm GSWCLOF[J]. Computer Engineering and 
Applications, 2016, 52(19): 7-11. 

[24] Wang Qian, Liu Shuzhi. Improvement of local outlier data mining 
method based on density [J]. Application Research of Computers, 
2014, 31(06): 1693-1696+1701. 

[25] Pukelsheim F. The Three Sigma Rule[J]. The American Statistician, 
1994, 48(2):88-91.