key: cord-0044199-3p13ugqh authors: Silva, António João; Cortez, Paulo; Pilastri, André title: Chemical Laboratories 4.0: A Two-Stage Machine Learning System for Predicting the Arrival of Samples date: 2020-05-06 journal: Artificial Intelligence Applications and Innovations DOI: 10.1007/978-3-030-49186-4_20 sha: 3f816a1cbb4ad735a54792bf22fb55fe51b849c2 doc_id: 44199 cord_uid: 3p13ugqh This paper presents a two-stage Machine Learning (ML) model to predict the arrival time of In-Process Control (IPC) samples at the quality testing laboratories of a chemical company. The model was developed using three iterations of the CRoss-Industry Standard Process for Data Mining (CRISP-DM) methodology, each focusing on a different regression approach. To reduce the ML analyst effort, an Automated Machine Learning (AutoML) was adopted during the modeling stage of CRISP-DM. The AutoML was set to select the best among six distinct state-of-the-art regression algorithms. Using recent real-world data, the three main regression approaches were compared, showing that the proposed two-stage ML model is competitive and provides interesting predictions to support the laboratory management decisions (e.g., preparation of testing instruments). In particular, the proposed method can accurately predict 70% of the examples under a tolerance of 4 time units. The Industry 4.0 concept assumes a high usage of Artificial Intelligence (AI), where industrial physical processes generate data that can be analyzed by Business Analytics, namely Data Mining (DM) and Machine Learning (ML) techniques, aiming to improve the factory efficiency (e.g., reduce costs, enhance production levels) [21] . This concept is transforming the Chemical industry, which has a large impact in the world economy (e.g., petrochemicals, pharmaceuticals). In this work, we address a relevant Business Analytics need of a chemical company, which is adopting a Industry 4.0 transformation. To ensure the quality of the products being manufactured, samples taken from the company production processes need to be tested in laboratories. The tests assure that the products are compliant with quality standards, allowing their usage by the company clients. Under this context, predicting the arrival of production samples at the laboratory is a key issue, since it helps in the allocation of equipment and human resources. Aiming to solve this task, this paper presents a novel twostage ML prediction system, which was developed during the implementation of a CRoss-Industry Standard Process for DM (CRISP-DM) [25] project that included three iterations, each focusing on a distinct regression strategy. During the modeling stage of the three CRISP-DM iterations, an Automated ML (AutoML) [12] procedure was adopted, allowing to compare and configure six state-of-the-art ML algorithms. The paper is structured as follows. Section 2 describes the related work. The business task, data and proposed approach are presented in Sect. 3. The obtained results are shown in Sect. 4. Finally, Sect. 5 concludes the paper. In recent years, there has been an increased interest in the field of AI, due to the rise of data, computational power and sophisticated learning algorithms (e.g., Deep Learning) [9] . Following the Industry 4.0 revolution [21] , many factories now are generating data that can be analyzed by DM and ML techniques in order to support managerial decision-making. Yet, several real-world DM projects tend to fail due to a misalignment between business needs and ML analyses [10] . The CRISP-DM is an open standard and robust methodology that was specifically developed to reduce this misalignment and increase the success of DM projects [25] . CRISP-DM includes six stages that are executed through several iterations and that involve both business and ML experts: business understanding, data understanding, data preparation, modeling, evaluation and deployment. CRISP-DM is a popular methodology. For instance, it has been applied to the Banking [18] and Health Care [3] domains. Regarding the analyzed chemical industry, the quality testing laboratories are mostly managed manually, with the usage of Information Technology (IT) being more focused on storing the test values rather than the process [16, 22] . Moreover, the data is typically spread through different databases what work as information silos (e.g., production, laboratory testing), thus it is difficult to have an easy access to all data under a single version of the truth. By adopting the Industry 4.0 concept, which assumes a better usage of IT, there is a potential gain to optimize the management of the chemical laboratories. In this work, we describe one aspect of the Industry 4.0 transformation that is being conducted by a chemical company. It corresponds to the result of implementing a CRISP-DM project that uses both production and laboratory testing databases. In terms of Predictive Analytics applied to the industry, most studies target predictive maintenance via several ML algorithms, such as Random Forest (RF) [4] , Neural Networks (NN) [23] and Gradient Boosting Machines (GBM) [17] . There are also studies about non maintenance prediction applications, such as: the classification of quality products produced by injection molding processes via Boosting, RF and NN models [5] ; and estimation of endpoint temperature and chemical concentration of a furnace when producing low-carbon steel using RF and ridge regression algorithms [19] . All these studies require the selection and configuration of the right ML algorithm, which often depends on the ML expert knowledge and that involves the usage of heuristics or trial-and-error experiments [14] . In order to avoid this time-consuming procedure (in terms of the ML expert effort), we adopt an AutoML [12] during the modeling stage of the CRISP-DM. This systematization and automation the ML model selection provides two main advantages. First, it alleviates the effort of the ML analyst, allowing to focus on other ML aspects in order to provide a better business value. In particular, in this paper, it allowed to implement more iterations of the CRISP-DM methodology, which was helpful to design the proposed two-stage ML model. Second, it reduces the ML maintenance effort, since the ML can be retrained automatically, as new data arrives, which is advantageous for the analyzed company. The analyzed chemical company produces several products, in batches. During the production-batch execution process, a sequence of samples, called In-Process Control (IPC), are selected for quality laboratory inspection, in order to ensure that the production process is running as expected. In terms of the chemical laboratories, the IPC samples have the highest priority, because the production process can not continue without their approval. A fixed amount of IPC samples are selected from each production-batch (s ∈ {1, ..., IP C max }). The production information system registers several attributes related to the IPC sample production, including its initial production time, denoted here as IPC production time P T s . One by one, the IPC samples arrive at the laboratory at time LT s , under irregular intervals that are difficult to be estimated in advance. The business goal is thus the non-trivial task of predicting of arrival time for each IPC sample at the chemical laboratories. Solving this task efficiently allows a better management of the laboratory equipment and human resources. For instance, some IPC quality tests require a setup time, in which the analysts need to prepare in advance the laboratory testing instruments. The business goal was addressed as a regression task, under two main target goals. In the first CRISP-DM iteration, we only used laboratory temporal data and the target goal was defined as predict y 1 = LT s+1 − LT s, which corresponds to the time lag between the next IPC sample arrival (LT s+1 ) and the current (already known) laboratory sample arrival (LT s ). In the second and third CRISP-DM iterations, we explored production temporal data, predicting y 2 = LT s − P T s , where the laboratory arrival time can be immediately estimated once the IPC sample starts its production. We used an Extract, Transform, load (ETL) procedure to merge the relevant data from two main databases related with the production and laboratory testing information systems, populating an integrated and business oriented data warehouse system. The ETL resulted in a raw file with 226,929 rows and 33 columns regarding all laboratory samples that were analyzed during a threeyear time period. The data warehouse was further filtered in order to contain rows related with IPC samples and with complete values in terms of the input and output attributes (Table 1) , leading to a dataset with 26,611 instances. The input variables were manually selected and defined from the filtered raw file using expert domain knowledge, obtained by interacting with the chemistry experts. Due the complexity of the chemical factory processes and information system integration issues, it was not possible to have access to a more richer set of data features (e.g., which components and machines were used to produce the samples). Thus, the resulting set of 8 inputs is rather small, which makes more challenging the prediction task. Both output targets were computed using a particular time unit, which is not disclosed here due to business privacy issues. In terms of computational environment, we adopted the R tool and its rminer package [8] for data manipulation and ML result evaluation, while the AutoML adopts the H2O implementation [7] . The AutoML procedure was configured to select the regression model and its hyperparameters based on the best Root Mean Squared Error (RMSE) computed using a validation set that is obtained by applying an internal 10-fold cross-validation method over the training data. All computational experiments were executed on the same personal computer and each individual ML model was trained up to a maximum running time of 3,600 s. Once a ML model is selected, the model was retrained with all training data. As in [11] , the AutoML was configured to include a total of 6 distinct regression algorithms: RF, Extremely Randomized Trees (XRT), Generalized Linear Model (GLM), GBM, XGBoost (XG) and a Stacked Ensemble (SE). The RF is a popular ensemble method that combines a large number of decision trees based on bagging and random selection of input features [15] . The XRT algorithm extends the RF approach by randomly selecting the decision thresholds of the tree nodes [13] . GLM estimates regression models for exponential distributions (e.g., Gaussian, Poisson, gamma) [15] . The GBM algorithm is a based on a generalization of tree boosting, sequentially building regression trees for all data features [15] . XG is another ensemble tree method that uses boosting to enhance the prediction results [6] . The SE method, also known as stacked regression [2] , combines the predictions of different base learners by using a second-level ML algorithm. The H2O implementation [7] uses the following AutoML setup: RF and XRT -set with the default hyperparameters; GLM -grid search used to set one hyperparameter (alpha, a regularization parameter); GBM and XGgrid search used to tune nine and ten hyperparameters (e.g., number of trees, maximum depth, minimum rows); SE -all five algorithms (RF, XRT, GLM, GBM, XG) are used as base learners and the individual predictions are weighted by using a second-level GLM learner. For the ML algorithms that require numeric inputs (e.g., GLM), the nominal inputs (e.g., product, grade) are previously transformed by using the standard one-hot encoding, which assigns one boolean input per categorical level. For instance, a categorical feature with three levels ({a,b,c}) is encoded as: a = (1, 0, 0), b = (0, 1, 0) and c = (0, 0, 1). A total of three CRISP-DM iterations were executed, aiming to improve the regression results and the potential value of the ML models. The first CRISP-DM iteration targeted the y 1 output, while the second and third CRISP-DM iterations approached y 2 , under two variants. The y 1 target is assumes that at least one IPC sample from the production-batch as arrived at the laboratory. The trained ML model can be used each time new sample arrives, allowing to estimate when the next sample will be delivered ( y 1 ). A different perspective is adopted by the y 2 target, since the fitted ML model can be applied to predict the laboratory sample arrival once an IPC sample production has started. The model employed in the second CRISP-DM iteration uses a simple regression with a single ML model ( y 2 ). During the evaluation stage of the second CRISP-DM iteration, we identified that there were some high prediction errors, in particular when predicting the arrival times for the first sample of the production-batch (s = 1). In order to check if we could improve these results, a third CRISP-DM iteration was executed, in which we specialize two distinct ML models (α and β). The first ML model (α) is trained using only the first product-batch sample examples (s = 1) and thus the fitted model includes only seven input attributes ({day, month, product, version, grade, stage, batch}). The second model (β) is only activated when producing the other product-batch IPC samples (s > 1). Similarly to the second CRISP-DM iteration model, this ML model is trained with all eight inputs (including s, the sample sequence number). The proposed two-stage model ( y 2αβ ) is shown in Fig. 1 . The collected data was divided into three main sets, by using a chronological order. The last 20 weeks of data (total of 5,110 examples) was kept out of the initial ML experiments. The goal is apply this additional unseen data in a more realistic evaluation, provided by a Rolling Window (RW) validation [24] that is executed for the best ML regression approach. The remaining and oldest 21,501 examples (not used as test set by the RW) were further divided into training and test sets (holdout split) [20] . The time ordered Holdout Split (HS) was used to compare the three distinct main regression approaches (from the CRISP-DM iterations). The training data included the oldest 15,050 examples (around 70%). As for the HS test set, it included 6,451 instances. Regarding the RW, it was set using a fixed training window with six months of data and a weekly testing of the ML models, in a total of 20 iterations. In the first iteration, at the first Sunday, the ML was trained with the last six months of historical data. Then, the model was used to perform sample arrival predictions for the incoming week (fixed test size of seven days). In the second iteration, executed at the second Sunday, the training window was updated by discarding one week of the oldest data and adding the previous week examples, allowing to update (retrain) the ML model, which then predicted the next week sample arrival times, and so on. In this work, we adopt two popular regression error measures: RMSE and Mean Absolute Error (MAE). We also use the Acc@T metric, which is more easily understood by the business analysts, since it measures the percentage of examples accurately predicted when assuming an absolute error tolerance of T . A quality regression model should provide low RMSE and MAE values and also a high accuracy for a small T value. The Acc@T concept allows to compare the predictive performance of different regression modes in a single graph, as proposed in [1] with the Regression Error Characteristic (REC) curves, which plot in the y-axis the Acc@T for different T values (x-axis). The overall quality (for distinct T values) can be measured by computing the Area of REC (AREC) curve when assuming a maximum tolerance of T max (in %). Table 2 presents the test data errors, in terms of the RMSE error measure, for the HS evaluation and when comparing the two y 2 prediction strategies: y 2 , executed during the second CRISP-DM iteration; and y 2αβ , explored in the third CRISP-DM iteration. The RMSE values confirm that for both prediction strategies, it is more difficult to predict the arrival of the first IPC sample (s = 1) than the arrival of the remaining samples (s > 1). It is interesting to notice that by specializing a learning model for each of these IPC sample types, as executed in the third CRISP-DM iteration ( y 2αβ ), a substantial error reduction is achieved for both sample types (s = 1 and s > 1). The full comparison of the aggregated HS results, assuming all IPC samples, is shown in Table 3 , which contains: the evaluation method used (Eval.); the best model selected using the AutoML procedure (Model); and several predictive performance measures. The AREC was computed by using a maximum tolerance of T max=16 time units. All performance measures confirm that the best predictive model was achieved by y 2αβ , while y 1 obtained better results than y 2 . When compared with y 1 , y 2αβ achieved a substantial predictive improvement: RMSEreduction of 46.8 points; MAE -difference of 14.1 points; and AREC -increase of 10% points. As for the ML algorithms, the AutoML selected GBM and SE as the best performing models when using the 10-fold internal cross-validation (applied over training data). The y 2αβ uses GBM for predicting the arrival times of the s = 1 samples and SE for the other ones. Figure 2 complements the HS results by showing the respective REC curves for the three main regression approaches. The plot confirms that for most of the low tolerance range (x-axis), y 2αβ provides a higher classification accuracy, resulting in an overall higher AREC. Indeed, the proposed two-stage ML model can predict correctly 37%, 59% and 70% of the samples for low tolerance values of T = 1, T = 2 and T = 4, a value that increases to 85% when the tolerance is increased to T = 16 time units. To estimate how the selected model ( y 2αβ ) would behave in a real environment setting, we tested it under a RW evaluation. The results for all 20 week iterations are shown in terms of the last row of Table 3 and show consistency when compared with the HS evaluation. In effect, the same AREC value is achieved (71%), while the RMSE and MAE values are slightly lower (RMSE of 37.5 and MAE of 11.4). This is an interesting result, since the RW evaluation used more recent test data, not seen when comparing the HS results. The obtained results were presented to the business domain experts, which considered them very positive, encouraging the incorporation of the two-stage prediction model into a friendly dashboard that included several business indicators to support the laboratory management decisions. To facilitate the visualization, the dashboard was designed to provide different granularity levels (hourly, daily or monthly) for the sample arrival prediction. For demonstrative purposes, Fig. 3 plots the real and predicted values when assuming a daily aggregation of the IPC sample arrival for a particular chemical laboratory and for the entire RW testing time period. Due to business privacy issues, the scale of the y-axis is omitted from the graph. Figure 3 shows that the predictions are very close to the real values, denoting a high quality fit of the prediction model. This paper addresses the non-trivial task of predicting the arrival of In-Process Control (IPC) samples at chemical laboratories for quality testing. To solve this task, we implemented the CRoss-Industry Standard Process for Data Mining (CRISP-DM) methodology, under three iterations, each focusing on a different regression approach. During the data understanding and preparation CRISP-DM stages, we collected recent data from a chemical company, resulting in 26,611 sample arrival examples related with a three-year time period. As for the modeling stage of CRISP-DM, we employed an Automated Machine Learning (AutoML) procedure, to automatically select and configure the best model when exploring six state-of-the-art ML algorithms. Several experiments were held. Using a time ordered Holdout Split (HS), we compared the three main regression approaches: y 1 -predict the time lag between the arrival of two consecutive samples (y 1 ), executed in the first CRISP-DM iteration; y 2 -predict the time lag between starting the production of the sample and its arrival to the laboratory (y 2 ), explored in the second CRISP-DM iteration; and y 2αβ -a two-stage ML model to predict y 2 , developed in the third CRISP-DM iteration. For all predictive performance measures, the best results were achieved at the two-stage ML model, which obtained interesting results (e.g., it can accurately predict 70% of the examples under a tolerance of T = 4 time units). The selected two-stage ML model ( y 2αβ ) was further evaluated using a realistic Rolling Window (RW) procedure, which considered 20 weeks of unseen data. A similar predictive performance was achieved, when compared with the HS results, showing that the proposed two-stage ML model is robust for the analyzed chemical company. In effect, the ML model was incorporated into a friendly dashboard prototype, obtaining a valuable feedback from the chemical laboratory managers. In future work, we intend to apply the two-stage model to predict the arrival of other types of samples (e.g., raw material). Moreover, we intend to further explore the deployment stage of CRISP-DM, by better integrating the proposed model in a decision support system tool. For instance, by using the predictions to directly optimize the laboratory human resources and instruments. Regression error characteristic curves Stacked regressions Using data mining for prediction of hospital length of stay: an application of the CRISP-DM methodology Real-time predictive maintenance for wind turbines using big data frameworks Integration of artificial intelligence in an injection molding process for on-line process parameter adjustment XGBoost: a scalable tree boosting system Practical Machine Learning with H2O: Powerful, Scalable Techniques for Deep Learning and AI Modern Optimization with R Human-level intelligence or animal-like abilities? The ten most common data mining business mistakes An automated and distributed machine learning framework for telecommunications risk management Efficient and robust automated machine learning Extremely randomized trees Which method to use? An assessment of data mining methods in environmental data science The Elements of Statistical Learning: Data Mining, Inference, and Prediction The future of the laboratory information system-what are the requirements for a powerful system for a laboratory data management? Machine learning application in predictive maintenance Using data mining for bank direct marketing: an application of the crisp-DM methodology Multivariate time series for data-driven endpoint prediction in the basic oxygen furnace On the use of holdout samples for model selection Smart factories in industry 4.0: a review of the concept and of energy management approached in production based on the internet of things paradigm Laboratory information management systems in the work of the analytic laboratory Concept of predictive maintenance of production systems in accordance with industry 4.0 Out-of-sample tests of forecasting accuracy: an analysis and review Crisp-DM: towards a standard process model for data mining Acknowledgments. This work has been supported by FCT -Fundação para a Ciência e Tecnologia within the R&D Units Project Scope: UIDB/00319/2020. The authors also wish to thank the chemical company staff involved with this project for providing the data and also the valuable domain feedback.