key: cord-0058340-os2rvg8h authors: Tufek, Alper; Aktas, Mehmet S. title: On the Provenance Extraction Techniques from Large Scale Log Files: A Case Study for the Numerical Weather Prediction Models date: 2021-02-15 journal: Euro-Par 2020: Parallel Processing Workshops DOI: 10.1007/978-3-030-71593-9_20 sha: b0abe75459d518928a6faa7b9ece7594a3b0410d doc_id: 58340 cord_uid: os2rvg8h Day by day, severe meteorological events increasingly highlight the importance of fast and accurate weather forecasting. There are various Numerical Weather Prediction (NWP) models worldwide that are run on either a local or a global scale to predict future weather. NWP models typically take hours to finish a complete run, however, depending on the input parameters and the size of the forecast domain. Provenance information is of central importance for detecting unexpected events that may develop during model execution, and also for taking necessary action as early as possible. Besides, the need to share scientific data and results between researchers or scientists also highlights the importance of data quality and reliability. In this study, we develop a framework for tracking The Weather Research and Forecasting (WRF) model and for generating, storing, and analyzing provenance data. We develop a machine-learning-based log parser to enable the proposed system to be dynamic and adaptive so that it can adapt to different data and rules. The proposed system enables easy management and understanding of numerical weather forecast workflows by providing provenance graphs. By analyzing these graphs, potential faulty situations that may occur during the execution of WRF can be traced to their root causes. Our proposed system has been evaluated and has been shown to perform well even in a high-frequency provenance information flow. The importance of fast and reliable weather forecasting in today's world continues to increase. Today, we almost always take weather conditions into account before we decide on a journey or any other kind of activity. Because of global warming, there is a significant increase in the number of extraordinary weather events. Weather events such as hurricanes, floods, high winds, etc. can cause large-scale loss of property and life if the necessary measures are not taken. In this context, faster and more accurate weather prediction becomes more crucial. This makes it necessary for meteorologists, scientists, and researchers to work together, share the input/output data they use, and exchange the results obtained. Various Numerical Weather Prediction (NWP) models are run each day in different meteorology organizations across the world to make weather forecasts. These models mathematically simulate the atmosphere and the oceans and calculate such parameters as temperature, pressure, wind speed, etc. by processing data primarily used for meteorological purposes, such as radar/satellite data and/or observation data gathered from weather observation stations. NWP models are usually run more than once every day at regular intervals. However, data collected from the aforementioned scientific measurement devices are very diverse, both in format and in size. Therefore, the management of data quality, reusability, and reliability become more complex and difficult. In this respect, the need for systematic provenance is gaining importance, especially in scientific studies [1] . Provenance is defined by the W3C consortium in its PROV specification [2] as all of the entities, events, and persons that have some impact on the process of generating a data product, which can be used to assess the quality and reliability of the data. Modifications to the data, the methods used in the production process, and metadata for reproducing the same data can be included in the definition of provenance. In this study, the Weather Research and Forecasting (WRF) model, one of the widely used NWP models, is used. WRF is an open-source NWP model widely used worldwide by meteorologists and researchers. Being open-source and having large community support can be considered as the advantages of the model. The WRF model is used for weather forecasts by meteorological organizations in many countries across the world, including Turkey. The WRF model, as well as other NWP models, take input parameters such as the boundary information of the prediction domain, and the resolution at which the predicted values are to be calculated. When the model starts to run it usually takes hours to produce its results, depending on the input parameters. Most of the time, it is not possible to intervene in the course of model execution. To be able to evaluate the correctness of the model outputs after its completion, it is of great importance to track the processing steps that took place during the generation process. In this way, whether an error occurred in the prediction phase and, if so, the location of the cause can be easily detected. The main motivation for this study is to address the lack of capability for provenance support in WRF model software, a widely used numerical weather prediction model. The WRF model is composed of several executable programs, each of which generates some particular log outputs. Other than that, there is no structured provenance generation or storage in any phase of a complete execution cycle. These raw log outputs are just free-form text lines containing various levels of information about the execution details. The contents of a log file that a specific WRF program produces can change, even from one version of the program to the next. The main contribution of this study is to address the aforementioned motivating points and provide methodologies for machine learning-based ways of provenance collection for WRF model software. We investigate WRF, and we analyze the log files generated in the course of its execution. We develop a machine learning-based parser, which utilizes classification algorithms and eliminates the need for a rule database to be present as a prerequisite. Log analysis is one of the commonly used methods to obtain provenance information. However, the quality of the provenance information produced in this method is both highly dependent on the level of detail of the log files and on what percentage of log lines containing provenance it can capture. In our previous work [3] , we developed a rule-based log parser to extract provenance information from WRF logs. This approach was based on a rule database that utilized a list of special keywords, which helps the log parser distinguish those lines containing provenance information. These keywords were predetermined manually. In this study, we propose a novel approach to provenance extraction from log files, which is based on machine-learning techniques. In this approach, a classification model is constructed by training on various log files before deployment. Here, the machine learning-based parser eliminates the need for a rule database at the expense of a small percentage of provenance loss. The paper is organized as follows. Section 2 provides a literature review. Section 3 presents a brief overview of the Weather Research and Forecasting (WRF) model. Section 4 briefly mentions the PROV specification. Section 5 explains detailed information about the proposed methodology for machine learning-based provenance parser. In Sect. 6, the implementation details of the prototype system are explained. In Sect. 7, the performance tests on the proposed framework are mentioned and the test results and evaluations are discussed. Finally, in Sect. 8, the results obtained in the study are summarized. Both storage and computing capacities of computer systems are increasing day by day, so computer systems are being used more frequently by all scientific disciplines to solve problems that require complex calculations and/or require the processing of large volumes of data. The scientific programs developed within the scope of these scientific studies are generally developed by scientists from their own disciplines, so the priority of the developers is to produce algorithmically correct solutions to scientific problems. For this reason, scientific programs generally do not have an integrated provenance infrastructure. Simmhan et al. propose a general-purpose framework in their work in 2006, which allows provenance information to be compiled from data-driven scientific workflows [4] . They try to define the requirements for systems that collect data and workflow provenance. They also develop a standalone tool, Karma [5] , as a prototype for the collection, representation, and storage of provenance data. Karma then evolves into the PROV compliant Komadu [6] framework that is used in this study as the provenance storage backend. This provenance framework is tested on the Linked Environments for Atmospheric Discovery (LEAD) project by Droegemeier et al. [7] . LEAD is a meteorological research and training project that is Service-Oriented Architecture-SOA based and designed to enable operations such as access, pre-processing, assimilation, management, analysis, data mining, visualization, etc. to be easily applied, independent of the format and the location of the data. Karma is workflow-oriented and needs a workflow orchestrator. Therefore, it needs each discrete event (workflow step) to be defined and implemented as an SOA-service. SOA-based architectures have been studied in detail in different studies such as [8] [9] [10] . In our study, we focus on numerical weather forecast models, particularly WRF, and make no modifications to the scientific source code. Our approach does not require a workflow orchestrator. We analyze log outputs and make inferences about the internal steps of the execution. In 2013, Jensen et al. proposed a provenance framework to be used in the processing of satellite data [11] . NOAA and NASA instrument data from satellites are beamed down to locations where they are gathered and then sent for processing. Jensen et al. used the Karma tool as the backend provenance storage and retrieval in their proposed framework and developed an adaptor to extract provenance-related activities from application log files. The Karma provenance system uses an extension of version 1.1 of the Open Provenance Model (OPM) [12] for its data model for external communication. Shu et al. conducted a similar case study on the modeling and analysis of provenance data in hydrological models [13] . They present a provenance model for the representation of provenance information in streamflow forecasting. For this purpose, they extend the Open Provenance Model to satisfy the requirements for their case. There exist various other provenance-based systems utilizing the Karma tool [21] [22] [23] . In our study, the provenance representation and data model are fully compatible with the W3C consortium's PROV specification, which defines a common provenance framework that is independent of a specific domain. However, to the best of our knowledge, weather prediction/atmosphere modeling systems that are run either on a global or a regional scale by meteorological organizations or by universities or research institutions within the scope of scientific research or weather forecasting are not capable of producing, storing and analyzing systematic provenance records. The Global Forecast System (GFS) 1 is a non-open source numerical weather prediction system that includes a global model run by the United States' National Weather Service (NWS). It is workflow-based and composed of multiple workflow components (data assimilation, forecast model, post-processing, etc.). Bernardet et al. proposed an infrastructure, NWP Information Technology Environment -NITE, for scientists to configure, launch, and track experiments with various NWP models including GFS. The main goal is to record the provenance of codes, scripts, and configuration files, and inputs related to an experiment, so that it can be reviewed and reproduced [14] . ECMWF's Integrated Forecasting System (IFS) 2 has its own workflow management system, ecFlow. Each workflow must be defined as task suites. In this study, we have designed a provenance/tracking system for the open-source WRF model that is used by most meteorological organizations in different countries across the world. The log files produced during the execution of the WRF model are analyzed, and lines containing provenance information are filtered in the first stage. In the second stage, the corresponding provenance notifications are generated and recorded in a provenance database in the background according to the information in the filtered lines. In our earlier work [3] , we proposed a rule-based log parser to extract provenance information from these log files. The parser utilized a rule database that consists of a list of special keywords to distinguish lines containing provenance information. In this study, we introduce a novel approach to filtering log files. In this approach, line filtering is achieved by machine-learning methods. Text classification by machine-learning algorithms has been used in countless areas such as search engines [15, 16] , social media platforms [17] , indexing, and emotion analysis in texts [18, 19] . We utilize various text classification algorithms inside the machine learning-based log parser. This way, our tracking and provenance analysis tool can run on different log files when filtering the lines containing provenance in WRF log files without the need for a rule base. The Fig. 1 . More detailed information about each phase can be found in Section III of our previous paper [3] . The PROV specification [20] is a family of general-purpose documents recommended by the W3C consortium for modeling, representing, storing, and transferring of provenance data in a standard way independent of the discipline. While PROV-DM defines a basic data model for provenance data, PROV-N defines a provenance notation that people can understand. Besides, PROV-XML defines the framework of an XML schema so that provenance data can be stored and transferred in accordance with the PROV-DM data model, while PROV-O provides the necessary definitions to be able to create provenance ontologies by expressing the PROV data model with the help of OWL 2 Web Ontology Language. Since the PROV specification is intended to provide a common provenance framework that is independent of a specific discipline, there are only three basic concepts and basic relationships that can be established between those concepts: Activity, Entity, and Agent. According to the PROV specification, an entity can be anything physical, digital or conceptual, or a real or virtual thing. An activity is defined as anything that takes place in a given period of time and that carries out certain operations on entities. Operations such as processing, transforming, changing, using, or generating an entity are examples of activities. An agent is generally defined as anything that has certain responsibilities concerning entities or activities. An agent may be an entity or an activity. There are various approaches to obtaining provenance information. The first is the manual labeling approach. This method is not effective since it requires a high amount of labor and time. It is also error-prone because it is human-handled. The second approach is to modify the source code to make it produce provenance records automatically. However, the disadvantages of this method are the lack of access to the source code of the software at hand, the need to recompile the code after the changes to the source code, as well as the additional errors this may cause. A third method sometimes referred to as scavenging, is to examine sources such as various log files that are generated during the execution of programs and to extract provenance information from these sources. Even if it may lack a configurable debug level setting or enough information for a complete provenance, this approach is more applicable to most use cases than the other two methods. In our previous work [3] , we introduced an alternative approach that utilized both the scavenging method and the instrumentation of the shell script files. These are external shell scripts that are not part of the WRF software. They just invoke the required WRF components and insert the related provenance information into the log file. In that previous work, we proposed a rule-based provenance extraction method where the rules must be maintained manually by the programmer to adapt the parser to different log files. In this study, we investigate the use of supervised learning algorithms to model the provenance data and predict the type of provenance notifications. Here, machinelearning algorithms are employed during the analysis of log lines produced by the WRF model. In other words, lines containing provenance information are predicted by using text classification methods. Figure 3 shows the component diagram of the machine learning-based provenance collection methodology that we have developed. To illustrate the testing of the machine learning algorithms, the following supervised learning algorithms are used to classify lines containing provenance information: Logistic Regression, Naive Bayes, Random Forest, and Multilayer Perceptron. Data Pre-processing: Using N-gram frequency profiles, one can provide a simple data representation to categorize text files for a wide range of classification tasks. N-gram frequency profiles are a commonly used approach in text classification. An N-gram is usually referred to as an N-character slice of a longer string. To illustrate testing of the text classification for provenance data, in this study, we simply use N-word frequency profiles and take into account 1-word string for data representation. Feature indexes in "feature_index:value" pairs indicate the index numbers that are automatically assigned, starting from 1, to each different word in the log file in the order they are encountered. The value number in "feature_index:value" pairs indicates the frequency of the word in the log line. In this study, the value part is assigned as 1 in all samples, since only the term existence is considered rather than the term frequency in the scope of the study. Training Dataset: Note that, classification algorithms require a training data set to construct classification models. In this study, we created a labeled dataset for WRF log files. We created a training dataset by scanning each line and checking whether it contains any of the provenance data and determine the type of provenance relationship. The training dataset is constructed by manually examining sample log files obtained from the WRF scientific program modules. In the pre-processed log files after N-gram conversion, the first value in each row represents the label of the class to which that row is assigned. The class labels that start from zero, indicating an irrelevant line, and will go up to the total number of different types of provenance relationships, incrementing by one. We use the following possible provenance relationships as labels according to the PROV-O Specification: used, was-GeneratedBy, wasAssociatedWith, wasInformedBy, wasAttributedTo, wasDrivedFrom, actedOnBehalfOf . Model construction is performed with a training set of log files before the system is deployed. The log file is given as input to the machine learning algorithms to train a classification model. In the scope of the study, we constructed various machine-learning models by using Logistic Regression, Naive Bayes, Random Forest, and Multilayer Perceptron algorithms. The classification process for the new WRF log lines is performed based on the constructed models. Each model predicts one class label from the available multi-class labels. After the prediction phase, the Adaptor constructs a provenance notification with the appropriate provenance relationship and sends it to the provenance repository. We discuss the evaluation of the prediction tasks in Sect. 7. The proposed approach can be used in the same way to analyze the log files of different Numerical Weather Prediction models other than the WRF model, without requiring software development. To illustrate the testing of the proposed system, we developed a prototype. The machine learning-based provenance parser is designed as a middle layer software between the WRF software and the provenance repository software. For the repository, we used a PROV-O compatible provenance storage technology, Komadu Service. In this study, Turkish State Meteorological Service 3 provided the computational facilities and input atmospheric data for running the WRF model. We obtained the log files from various runs of the WRF model and used them in the testing of the proposed provenance extraction methodologies. The WRF model is run with the highest possible debug_level to minimize any possible missing provenance information. PROV-DM, an XML-based W3C PROV specification, is used for modeling provenance information obtained from provenance extraction software. The implementation of the machine-learning algorithms used in this study is done by using the MLlib (Machine Learning Library) library of Apache Spark 4 , an opensource cluster-computing engine developed within the Apache Software Foundation. The classification algorithms in the MLlib library accept files of LIBSVM format as input. LIBSVM is a sparse feature vector notation, a file format with rows representing a feature vector that is composed of "attribute_index:value" pairs separated by a space character. We investigated the performance of the prototype and discussed the results in the next section. To evaluate the performance of the proposed system, various experiments are conducted. All of the system components are implemented in Java language. A working prototype of the system is deployed on two virtual machines running on top of a computer with a Windows 10 operating system, Intel Core i7 4720HQ CPU, and 8 GB RAM. One of the virtual machines features Ubuntu 12.04 as the guest operating system, where the Komadu Service runs stable. The other virtual machine uses Ubuntu 16.04, where the WRF runs smoothly. Both virtual machines have 4 GB of RAM and two CPU cores. The Java version used is JDK 1.8.0. The proposed system's classification performance in terms of accuracy and precision/recall metrics is evaluated. To illustrate the testing of the parser, various classification algorithms are picked up from different categories. The Logistic Regression algorithm is selected among regression-based classifiers. Naive Bayes is picked up among Bayesian classifiers. The random forest algorithm is selected as a representative of treebased classifiers. Finally, multilayer perceptron is picked up from neural network-based classifiers. Three different versions of log files are used in the experiments for the evaluation of classification performance. One of these files is generated by the WRF model executed with debug_level 100 while the other two are generated with debug_level 150. The summary statistics of these log files are given in Table 1 . 'Dictionary size' in the table refers to the total number of distinct words in the corresponding log file. Raw log files must first be converted to LIBSVM format to be input to the classification algorithms. For this purpose, log lines are converted to feature vectors by using the Apache Spark machine-learning library's CountVectorizer and CountVectorizerModel classes. Then, static class labels are determined for multi-class classification. Afterward, data in LIBSVM format are input to Logistic Regression, Naive Bayes, Random Forest, and multilayer perceptron algorithms, and classification models are trained. To evaluate the performance of the classification process, the classification accuracies of the algorithms are examined. Each test procedure is repeated 100 times for each algorithm-log file combination, and the average with the standard deviation of classification accuracy is calculated. A 10-fold cross-validation approach is chosen for the evaluation methodology. Each log file is randomly split into 10 subsets of approximately equal size. Each time, a different subset is selected as the test set, and the remaining subsets are used for training. The overall performance metric for a specific log file is calculated by taking the average of the results calculated for each one of the 10 subsets. We evaluated all the algorithms' multi-class classification performance in terms of accuracy and precision metrics and the results can be seen in Fig. 4a and Fig. 4b , respectively. We notice that all algorithms seem to have achieved very high classification accuracies and performed with high precision and high recall. Therefore, we argue that successful provenance extraction can be conducted without the need for a ruleset. As a last note, the size and the contents of the log files produced by the WRF model may show some variations depending on the parameters, such as the size of the region to be predicted or the length of the prediction period. However, when different log files are examined, it can be seen that they generally have a common pattern and a high degree of similarity. For this reason, it is observed that the machine-learning models achieve very high performance on various log files obtained after running the WRF model with different initial parameters, such as time periods or prediction regions. In this study, we investigate a machine learning-based approach to provenance extraction from log files of scientific applications. In this approach, supervised learning algorithms are used to model the provenance data and predict the type of provenance relationships. Here, machine-learning algorithms are employed during the analysis of log lines produced by the WRF model. To obtain different provenance relationships from the log lines, multi-classification is utilized. The results indicate that successful provenance extraction can also be conducted by utilizing machine-learning algorithms without the need for a ruleset. Hence, the use of machine-learning algorithms for log parsing for provenance can eliminate the need for a rule database. To facilitate testing of the system, we developed a prototype implementation and made it available as open-source software at the GitHub repository. The system is implemented with Apache Spark's MLLIB library, from which the Logistic Regression, Naive Bayes, Random Forest, and multilayer perceptron algorithms are applied for multi-class classification. The trained models are run on the sample log files, and it is observed that they perform well even on log files containing a large number of lines. A survey of data provenance in e-science The W3C PROV family of specifications for modelling provenance metadata Provenance collection platform for the Weather Research and Forecasting Model A framework for collecting provenance in data-centric scientific workflows Karma. Pervasive Technology Institute website Komadu: Provenance collection and visualization tool based on W3C PROV standard Linked environments for atmospheric discovery (LEAD): architecture, technology roadmap and deployment strategy XML metadata services High-performance hybrid information service architecture Information services for dynamically assembled semantic grids Provenance capture and use in a satellite data processing pipeline The open provenance model core specification (v1.1) Modelling provenance in hydrologic science: a case study on streamflow forecasting The design of a modern information technology infrastructure to facilitate research-to-operations transition for NCEP's modeling suites Building domain-specific search engines with machine learning techniques A machine learning architecture for optimizing web search engines Finding high-quality content in social media Sentiment analysis in Twitter using machine learning techniques Thumbs up?: Sentiment classification using machine learning techniques PROV-Overview: An Overview of the PROV Family of Documents Detecting misinformation in social networks using provenance data An approach to custom privacy policy violation detection problems using big social provenance data Application of provenance in social computing: a case study