key: cord-0058737-wy169iw6 authors: Antunes, Ana Rita; Braga, Ana Cristina title: Shiny App to Predict Agricultural Tire Dimensions date: 2020-08-20 journal: Computational Science and Its Applications - ICCSA 2020 DOI: 10.1007/978-3-030-58808-3_19 sha: 8b094f8eec51905f440da0e7b9b2e82d51129ddc doc_id: 58737 cord_uid: wy169iw6 The main objective of this project, carried out in an industrial context, was to apply a multivariate analysis to variables related to the specifications required for the production of an agricultural tire and the dimensional test results. With the exploratory data analysis, it was possible to identify strong correlations between predictor variables and with the response variables of each test. In this project, the principal component analysis (PCA) serves to eliminate the effects of multicollinearity. The use of regression analysis was intended to predict the behavior of the agricultural tire considering the selected variables of each test. In the case of Test 1, when applying the Stepwise methods to select the variables, the model with the lowest value of Akaike Information Criterion (AIC) was achieved with the technique “Both”. However, the lowest value of AIC for Test 2 was achieved with “Backward”. Regarding the validation of assumptions, both Test 1 and Test 2 were validated. Therefore, all the quantitative variables are important, both in Test 1 and Test 2, because they are a linear combination that determines the principal components. In order to make it easier to compute predictions for future agricultural tires, an application that was developed in Shiny allows the company to know the behavior of the tire before it was produced. Using the application, it is possible to reduce the industrialization time, materials and resources, thus increasing efficiency and profits. In the industrial process of production of a new tire, it is necessary to consider some specifications. The agricultural tire is constituted with different components like the tread, belt, inner liner, sidewall, bead, among others. In this case, it is important to define the mold, the material and the quantity. After that, the tire has to pass some tests, for example, dimensional and endurance tests, among others. The tire passes the test if the results are in accordance with the legal norms, where the maximum and minimum of dimensional and endurance values are defined. So, when the test result is greater than the maximum defined, the tire doesn't pass and the company has to make changes in the type and/or the quantity of materials. In this study, the main goal was to apply multivariate analysis to variables related to the specifications required for agricultural tire production and dimensional test results, Test 1 and Test 2. The purpose of this was to understand which variables influence the test results and to predict their values. So, to develop a tire it is important to consider a lot of variables simultaneously and, if it is possible to predict the values for the two tests, it will make it easier for the producers. Multiple Linear Regression (MLR) will help to achieve the results that we want, because the predictors variables are quantitative. MLR has many assumptions to be considered and one of them is multicollinearity effects. Multicollinearity effects are when two or more predictor variables have a strong correlation among themselves. This can cause problems in MLR. When we estimate regression coefficients and the predictor variables are highly correlated, the coefficients tend to vary widely. Another problem is when we want to make an interpretation of a regression coefficient, the signal can be misleading [5] . One possibility to correct this problem is using Principal Component Regression (PCR), which is a linear regression using principal components. Maxwell et al. [7] wrote an article to tackle with multicollinearity effects and here 5 methodologies were tested: Partial Least Square Regression (PLSR), Ridge Regression (RR), Ordinary Least Square Regression (OLS), Least Absolute Shrinkage and Selector Operator (LASSO) Regression, and the Principal Component Regression (PCR). To compare the 5 methodologies, they used a different number of observations and a number of predictor variables. Root Mean Square Error (RMSE) and AIC were used to compare the performance of each model. With this analysis the authors concluded that PCR has the lowest AMSE and AIC, which means that according to them, PCR is the most efficient in handling critical multicollinearity effects [7] . Lafi and Kaneene used Principal Component Analysis (PCA) to detect and correct multicollinearity effects in a veterinary epidemiological study. In this article were compare OLS and PCR to adjust regression coefficents. The PCR coefficients were more reliable than OLS [6] . After selecting the model for Test 1 and Test 2, a web application was developed to predict the test results before the tire was produced. PCA is a statistic procedure for multivariate problems. It was introduced in 1901 by Pearson and in 1933 it was independently developed by Hotelling [8] . PCA is useful when there are many predictor variables regarding the number of observations in the dataset. It is also used when the predictor variables are highly correlated with each other because it eliminates the effects of multicollinearity. Normally, PCA is used to reduce the dimensionality of a problem and principal component represents most of the information contained in the dataset. This means that the first PC explains the greater proportion of the original variables variation and the second explains the second greater proportion, but it is independent of the first, and so on. As it is widely known, PCA transforms the variation of a set of variables highly correlated into a new set of variables that are uncorrelated and orthogonal. This new set of variables is a linear combination of the initial p variables. Linear combinations are described as follows (Eq. 1): where a ij are the loadings, x 1 , x 2 , ..., x p are the initial variables and P C 1 , P C 2 , ..., P C p are the p PCs [4] . After obtaining the linear combinations for each component, and when replacing them with the values of the initial variables, the scores are obtained [5] . With linear regression it is possible to study the linear relationship between response variable (y i , i = 1, ..., n) and one or more predictor variables (x ij , j = 1, ..., p), where response variable is a quantitative variable and predictor variables can be quantitative or qualitative. When there is more than one predictor variables it is called Multiple Linear Regression (MLR), (Eq. 2), where β 0 is the constant term and β p are the coefficients for each variable. To validate the model, it is necessary to verify some assumptions and this can be performed through an explanatory analysis of residuals. Thus, the assumptions to be validated are as follows [3] : • E[ε i ] = 0, this means the average of the errors must be zero; , with this, errors must follow a normal distribution; • Errors are independent. Another condition to be verified is the existence of multicollinearity and this can be identified by the correlations values and/or considering the Variance Inflation Factor (VIF). When VIF is greater than 10, it means that there are multicollinearity effects in the data. VIF is given by the expression: where R 2 j is the coefficient of determination of x j relative to the other predictor variables in the model [9] . Variable Selection Method. The stepwise method is used to obtain a model with predictor variables that better explain the variable response and it is possible to consider different criteria, for example, AIC. The "Backward" method builds the regression model using all the predictor variables and removes them considering the chosen criteria. The "Forward" method adds the predictor variables one by one until there are no more candidates that increase the value of the sum of squares in the regression model. However, it is possible to build a regression model with the entry and elimination of the predictor variables, considering the chosen criteria, called "Both" method. The iterative process ends when there are no more variables to be introduced or eliminated according to the criterion adopted [9] . One way to analyze the model that better explains the data in study is the value of AIC. This criterion compares the adequacy of the models when an attempt is made to balance the accuracy of the adjustment and the smallest number of explanatory variables [2] . The AIC value is calculated as follows: where L p is the maximum value of likelihood function for the model and p is the number of predictor variables present in the model. The models with lowest AIC are the chosen ones [1] . For this analysis were used 146 experimental agricultural tires, 31 predictor variables, 4 of which are qualitative variables and 27 quantitative variables. We used 2 response variables, y 1 and y 2 for Test 1 and Test 2, respectively. The variables were coded due to a confidentiality agreement. All computations were made in R software using the appropriate packages available to perform the analysis. Figure 1 represents the correlation coefficients (color intensity and the size of the circle are proportional to the correlation coefficients) and there are strong relationships with variable y 1 , variable response for Test 1, as well as with y 2 , variable response for Test 2. Taking into account the values of the correlations of Fig. 1 , multicollinearity effects are expected due to the values taken from r between the predictor variables once these variables are correlated with each other. It is also possible to see that x 6 , x 7 , x 8 , x 9 , x 12 , x 13 , x 18 , x 19 , x 23 , x 25 , x 27 , x 32 and y 1 are correlated (where r > 0.90), as well as between x 5 , x 10 , x 11 , x 15 and y 2 (where r > 0.90). As referred in Sect. 2.1, PCA can be used to reduce the dimensionality of a problem or to eliminate multicollinearity effects. In this study, it was necessary to prove if multicollinearity effects exist. Regarding this problem, the data were normalized since the variables take different scales of measures. In Table 1 , the VIF values for 19 quantitative predictor variables are presented and the results are the same in Test 1 and Test 2, when an MLR was made for both tests. The VIF values for the other response variables are less than 20. Regarding the results obtained in Table 1 , there are multicollinearity effects in the study, because most of the VIFs values are higher then 10. Since the main objective is to build models that allow predictions for Test 1 and Test 2, the conditions of applicability of MLR models must be guaranteed. For this reason, we opted to use PCA to eliminate the effects of multicollinearity. For this reason, the 27 principal components were used in the models for Test 1 and Test 2 instead of the original variables. The graph in Fig. 2 represents the biplot after the rotation varimax for the first two principal components, where the first explains 52.5% and the second 15.6% of the total data variation. It can be seen in Fig. 2 that variables x 5 , x 11 and x 15 have the greatest positive contribution for the second principal component. Variable x 14 has the greatest negative contribution for the first component. However, the other variables have the greatest positive contribution for the first component. In this study, the 27 principal components were used because it was necessary to consider all the information and, for this reason, it was difficult to perform the interpretation of each principal component. After the determination of each PC we proceeded to the construction of MLR models for each tire test. Two models were found using stepwise methods and considering AIC criteria to select the model for Test 1 and Test 2. In Table 2 the model using "Both" technique has the lowest AIC value, for Test 1, for this reason it was the selected model. For Test 2, the lowest AIC value is using "Backward" and this was the chosen one. Figure 3 shows the set of graphs produced in R using the plot (model) function to validate the assumptions. The first graph, Residuals vs Fitted, proves that the variance of residuals is constant and that residuals are independent because there isn't any pattern or tendency. The second graph shows that the errors follow a normal distribution, since the values are according to the diagonal, except on the extremes, which can indicate the presence of outliers. The Kolmogorov-Smirnov test was used to confirm if the errors follow a normal distribution, considering the following hypotheses: H 0 : ε i ∼ N (μ, σ 2 ) VS H 1 : ε i N (μ, σ 2 ). For this test, the p − value = 0.615 reveals that the errors could follow the normal distribution for a significance level α = 0.05. The last graph, Residual vs Leverage, shows there are no influence points. Fig. 4 it is possible to draw the same conclusions for Test 2. Looking at the Normal Q-Q plot, most of the values are according to the diagonal, except on the extremes, which means there isn't evidence to reject the null hypothesis. Regarding the Kolmogorov-Smirnov test, p − value = 0.966, the null hypothesis isn't rejected and the errors could follow the normal distribution for a significance level α = 0.05. When the extremes in Normal Q-Q plot are straight out it can mean there are outliers. The graphs in Fig. 5 reveal that there are five outliers for Test 1 and four for Test 2. All of them were individually analyzed to understand if it is a process problem or a human error since most of the values are not automatically introduced into company programs. The entire analysis was repeated, for both models, after removing the outliers and it was found that by using the same criteria the results were not very different and outliers continued to exist. Since all the possible variables were not used in this study and the values of each observation, considered as an outlier, do not appear to be a human error, so it was decided to keep all the observations. The main objective of this study was to predict the results for Test 1 and Test 2 based on the constructed models. For this reason, it was developed a web application using Shiny. In the application it is possible to do two things: upload the dataset and make predictions based on the values of the variables. Before creating the application it was important to define the necessary steps for its construction, which are represented in Fig. 6 . Programming Code. In order to obtain the application interface, a programming code was developed. Shiny is divided into two parts, "ui" and "server". "ui", known as the interface, is used to define how the web application is going to look like. In contrast, "server" is used to define what the application is going to do and this is where the calculations for making predictions for Test 1 and Test 2, are made. Before starting programming, four Excel documents were added to be used in a later stage. Figure 7 shows the information related to the dataset under study, the coefficients for Test 1 and Test 2, and the loadings for each principal component. Firstly, in "ui" the menus were defined as "Upload Dataset" and "Prediction". In lines 24 and 25 is where the user can choose the file to load for the application. Regarding "Prediction", it specifies the quantitative variables, using "numericInput", and the qualitative variables, using "selectInput" (Fig. 8) . In order to show the prediction for Test 1 and Test 2, in line 124, was created a button "Go" and the next line is to show the table. On the following lines the colors of the application are defined (Fig. 9 ). The next step was to define the necessary calculation to predict the value for Test 1 and Test 2. In the first place, the data have different scales and for this reason the data were standardized and the values introduced for each variable were saved (Fig. 10) . Thereafter, it was important to define which variable is quantitative to determine the principal components for the 27 variables. After that, using the quantitative variables and the loadings obtained before, the principal components were calculated (Fig. 11) . Finally, the models for Test 1 and Test 2 were calculated using the selected model coefficients for each test and the principal components obtained before. In Fig. 12, n1 and n4 represent the MLR for Test 1 and Test 2, respectively. The maximum is calculated using the expression in n2 and n5. After this, one condition was created to verify if a tire passes the test, represented by n3 and n6. With this information, lines 264, 265 and 266 were used to construct the table with the calculated results. The last line is used to run the application. Application Interface. In Upload dataset it is possible to filter the data considering what is necessary to predict the value for Test 1 and Test 2. In Fig. 13 there is an example using a created dataset for an agricultural tire to explain only this functionality. In this case there are five variables and "Search" is an input for what we want to look for: for example, the tire identification number. The data have 15 different tires, where there are 2 tires that contain the number identification 370881 (lower left corner). Whoever wants to use the application for agricultural tires can filter for the tire number identification and its specification appears. This will be necessary for predicting Test 1 and Test 2. Making predictions was one of the aims for this study and by using the developed application, the results for Test 1 and Test 2 can be predicted before tire production (Fig. 14) . The chosen models for Test 1 and Test 2 use principal components that are a linear combination of the initial variables and for this reason it is fundamental to insert the 27 initial variables and the 4 qualitative variables. Therefore, when the variable is quantitative the user has to introduce the value, and when it is qualitative he has to select the pretended level. To make predictions, all the variables have to be filled and after that the results appear when the button "Go" is clicked. The application gives the results for Test 1, y 1 , and Test 2, y 2 . In addition, the "Result" (Fig. 14) indicates if the tire passed the test. For the production of agricultural tires it is necessary to consider legal norms and both tests have a maximum that cannot be exceeded. When the result is greater than the maximum, the tires do not pass the test, the specification has to be modified and in "Result" appears "Not passed". Otherwise, the agricultural tire passes the test and in "Result" appears "Passed". The main goal was to apply multivariate analysis to variables related to tire production and identify the influences on the two tires tests. In the exploratory analysis it was possible to identify strong correlations between the quantitative variables, including the response variables for each test. With the variance inflation factor, it was possible to identify the existence of multicollinearity between quantity variables and this could be a problem when applying linear regression. Principal component analysis was used to eliminate multicollinearity effects and to retain as much information as possible to apply to the models. For this reason, it was decided to use the 27 principal components and it was difficult to understand the meaning of each principal component considering the loadings' values. Multiple linear regression was used to identify the significant variables to improve the agricultural tire production. This was also difficult to identify because we considered the 27 principal components and the qualitative variables. One of the objectives of this study was to find a multiple linear regression for the two tests. For the selection of variables we used Stepwise methods and the choice of the model to be considered was made taking into account the AIC value. After obtaining the models for the two tests, an application was developed in Shiny in order to quickly and efficiently determine the test results for future agricultural tires. By using the application it is possible to reduce the quantity of materials and resources as there is an increase in efficiency and profits since this application can predict the performance of the tire before starting its production. In addition, reducing the industrialization time is also an advantage, because some specifications can be canceled before the production phase. It also helps to preserve the environment by reducing the destruction of tires with bad performances. Therefore, this application helps the users to select the best specification for the agricultural tire, thus generating more security in the specification to be used and enabling a reduction of errors by the research and development department. Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach Handbook of Regression Analysis Applied Multivariate Statistical Analysis, 5th edn Applied Linear Statistical Models, 5th edn An explanation of the use of principal-components analysis to detect and correct for multicollinearity Handling critical multicollinearity using parametric approach Multivariate statistical data analysis-Principal Component Analysis (PCA) Applied Regression Analysis: A Research Tool Acknowledgments. This work has been supported by FCT -Fundação para a Ciência e Tecnologia within the R&D Units Project Scope: UIDB/00319/2020.