key: cord-0144242-pejid3xb authors: Betgeri, Sai Nethra; Vadyala, Shashank Reddy; Matthews, John C.; Madadi, Mahboubeh; Vladeanu, Greta title: Wastewater Pipe Condition Rating Model Using K- Nearest Neighbors date: 2022-02-22 journal: nan DOI: nan sha: 610407662c8d9d8086f4627ca35eeb8167781d78 doc_id: 144242 cord_uid: pejid3xb Risk-based assessment in pipe condition mainly focuses on prioritizing the most critical assets by evaluating the risk of pipe failure. This paper's goal is to classify a comprehensive pipe rating model which is obtained based on a series of pipe physical, external, and hydraulic characteristics that are identified for the proposed methodology. The traditional manual method of assessing sewage structural conditions takes a long time. By building an automated process using K-Nearest Neighbors (K-NN), this study presents an effective technique to automate the identification of the pipe defect rating using the pipe repair data. First, we performed the Shapiro Wilks Test for 1240 data from the Dept. of Engineering&Environmental Services, Shreveport, Louisiana Phase 3 with 12 variables to determine if factors could be incorporated in the final rating. We then developed a K-Nearest Neighbors model to classify the final rating from the statistically significant factors identified in Shapiro Wilks Test. This classification process allows recognizing the worst condition of wastewater pipes that need to be replaced immediately. This comprehensive model is built according to the industry-accepted and used guidelines to estimate the overall condition. Finally, for validation purposes, the proposed model is applied to a small portion of a US wastewater collection system in Shreveport, Louisiana. Keywords: Pipe rating, Shapiro Wilks Test, K-Nearest Neighbors (KNN), Failure, Risk analysis The underground pipeline system forms a significant part of the infrastructure in the United States because it includes thousands of miles of pipes. Sanitary sewage collects wastewater from public and private users as part of wastewater infrastructure systems (Ariaratnam et al., 2001) . Approximately 800,000 miles of public sewage pipes and 500,000 miles of private sewer laterals are present (Lester & Farrar, 1979) . By 2032, 56 million people are expected to use centralized treatment plants O'REILLY et al., 1989) . Water supply and sewer water pipeline is basic need for society, and their security and efficiency are of paramount importance to human health and economic development. However, a substantial portion of these vital systems are decades old and are plagued with deficiencies and inefficiencies, as shown in Figure 1 . Using risk-based asset management, the most critical assets to take the most efficient course of action are identified by prioritizing the highest risk of failure by considering parameters such as pipe diameter, material, age, wall thickness (Aprajita, 2018) . The four focus areas of a risk-based asset management program are (1) understanding the deterioration modes and mechanisms, (2) risk assessment and management, (3) condition assessment, and (4) asset renewal, i.e., repair/rehabilitation/replacement. Using the traditional method, the number of failures received by sewer management can increase rapidly, making pipe failure handling imperative. However, manually identifying and classifying those failures through pipe repair documents and extracting the information related to those pipe failures are challenging because they are voice-generated text or manually written by the inspectors by inspecting through closed-circuit television (CCTV) videos. This manual process can lead to many more committed errors (Sai Nethra Betgeri, 2021; Different approaches use different input parameters to compute a structural and/or operational condition grade. Creating a condition assessment system like this is to prioritize assets for future interventions like comprehensive inspections and renewal programs. A rating method typically indicates the pipe's condition on a 1-5 scale, with 1 being the best condition and 5 requiring immediate replacement (Angkasuwansiri & Sinha, 2015; Companies; Khazraeializadeh et al., 2014; McDonald & Zhao, 2020; Opila, 2011; Wirahadikusumah et al., 2001 The degradation of wastewater pipes does not follow a predictable pattern and is influenced by various internal and external characteristics (Najafi & Kulandaivel, 2005) . The most common causes that impact sewage pipe degradation may be classified into the following groups: construction variables, external parameters, and other factors. Sewer pipe diameter, pipe material, burial depth, pipe bedding, load transfer, pipe joint type and material, and sewer pipe connection are all construction variables . In addition, surface loading, ground conditions, groundwater level, and soil type are included in external parameters (Chughtai & Zayed, 2008; Yan & Vairavamoorthy, 2003) . Finally, various factors include the type of waste carried, pipe age, sediment level, surcharge, and poor maintenance practices (Ennaouri & Fuamba, 2013) . This paper aims to automate the process of determining a comprehensive pipe rating through the development of an automated model that evaluates the overall condition of wastewater pipes segments. This is achieved by combining pipe characteristics, external characteristics, and hydraulic characteristics to be used by the utility department. The proposed model is designed to integrate previous research efforts and suggests an innovative method of using 12 factors encompassing physical characteristics, external conditions, and structural, operational, and hydraulic factors into assessing the overall condition of the pipe. We performed the Shapiro Wilks Test to perform hypothesis testing and determine the final factors that can be incorporated in the In Machine Learning, data satisfying Normal Distribution is beneficial for model building because many data in nature displays the bell-shaped curve when graphed. We finally developed a model with 10 factors identified from the Shapiro Wilks test for model building. The main goal of the developed model is to prioritize the worst condition assets for intervention planning in significantly less than 24 hours for classifying the rating through automation, where the inspector takes days or weeks to go through those pipes. A total of 3100 pipe data totaling approximately 285km (935,703 ft) is given. For this study, a total length of roughly 47 km (154,060 ft) of 200 mm (8 in.) diameter vitrified clay (VC) pipe, totaling 1240 pipe segments, were selected. Information such as diameter, depth, length of the pipes is given in pipe segment reports (i.e., pdf format), and the other information related to the pipes such as pipe age, corrosion, and the seismic zone is given in MS Excel from the Dept. of Engineering & Environmental Services, Shreveport, Louisiana Phase 3. There was no information related to loading, soil type, repair history in the documents, and these ratings were defined based on extensive information found in the literature (Data; Level; maps; Projections; Systems; Traffic) . These Pipe Section reports contain different sections, as presented in Table 3 . Each section contains text input by the inspector. Engineering & Environmental Services. The flow diagram of the data cleaning process to obtain the final data is shown in Figure 3 . We then randomly selected 200 documents and manually checked the data to verify if the same data was extracted using Python programming. The extraction and retrieval of information by the program were compared to the results of the manual review. Sample Data extracted using the python programming into the .csv file is shown in Table 4 . Data Preprocessing is when the data gets transformed, or encoded, such that the machine can quickly parse it. In this study, we included records with relevant data by removing 4.2% of records with inconsistent data, and 10% to 20% missing information info per pipe for further analysis. This step makes the training dataset cleaner and error free, which helps in improving the accuracy of the model. After all these analyses and verification of data, the final data collection included 3100 pipe segment reports, of which 1240 are considered for our analysis (Figure 4 ). "Total Length" field contains the "Length Surveyed." It may be due to human error, or maybe the information was misread while being scanned from a handwritten form by the CCTV inspector. We have eliminated 70 reports related to the inconsistent values. The comprehensive rating model incorporates the well-established industry-standard condition rating method, the PACP developed by NASSCO (2001), from which the PACP structural and operation and maintenance scores are added to the comprehensive rating, as well as other physical characteristics, external characteristics, and hydraulic parameters to determine overall pipe condition score. The K -Nearest Neighbor model is used for this purpose. We didn't consider the geographical location of the pipe for our model implementation. A comprehensive rating framework is shown in Figure 5 . The factors summary is presented in Table 5 . Factors' attributes, and the assigned ratings of physical characteristics (PC), external Characteristics (EC), and hydraulic Characteristics (HC) are presented in Table 6, Table 7 , and Extreme maintenance 5 In the next step, we determine the factors needed for our model. For this, we used the Shapiro Wilks test. Given an ordered random sample 1 < 2 . . . < , the original Shapiro-Wilk test (J. P. Royston, 1982; Royston, 1992) statistic is defined as where ( ) is the ℎ the smallest number in the sample, ̅ is the sample mean and the coefficients are given by where is a vector norm, i.e., = ‖ −1 ‖ = ( −1 −1 ) 1/2 and vector is given by ) and is the covariance matrix. is a number that ranges from 0 to 1. Small values of imply that normality is rejected, whereas a value of one show that the data is normal(J. Royston, 1982a Royston, , 1982b J. P. Royston, 1982; Royston, 1992 Royston, , 1995 . A significance level of 0.05 is considered. The next step is to build our model using − Nearest Neighbor (K-NN) classifier (Peterson, 2009) . K-NN classifies the new data points based on the similarity measure of the earlier stored data points. Input: : All factors, : Chosen Number of Neighbors Output: : Mode of labels Begin: • Load the data. • Initialize K to your chosen number of neighbors. • For each testing data: o Calculate the distance between 25% of testing data ( , ) with all 75% of the training data. (a, b) using Euclidean distance (ED) as shown in Equation 2. o Add the distance and the index of testing data to the ordered collection. • Sort the ordered collection of distances and indices in ascending order by distances. • Pick the first K entries from the sorted collection. • Get the labels of selected entries. • Return the mode of K labels. All variables consisting of the attributes are stored in a final spreadsheet. This final sheet is stored in JMP software (Jones & Sall, 2011; Kenett & Zacks, 2021) to perform the Shapiro Wilks test. The results for the Shapiro Wilks test are presented in Table 9 . We hypothesized that all the factors are normally distributed; however, the distribution was not normal for diameter and seismic zone, so the hypothesis was rejected. The factors for which > = 0.05 are considered in our modeling. We have divided the data into 75% training and 25% validation data, and the process is repeated several times with different values of K to reduce the errors and to make accurate predictions. We have finally chosen the value as K = 7. As the value of K is increased, our predictions become more stable and will have more accurate predictions up to a certain point. Figure 7 shows the graph of misclassification rate as a function of K for 25 and 30, and from both the graph, we see the lowest error is found at K = 7 with a value of 0.31290. We also checked for different values of K, such as 20 and 35, and we found the lowest value of misclassification rate at 7. So, we have used the value as K = 7 for better accuracy. Table 10 shows the count and misclassification rate for training data and testing data for different K values. Misclassification is slightly higher because of less training data for comprehensive rating 1 and slightly fewer training data for comprehensive rating 2 and 3 compared to comprehensive rating 4. This can be reduced when the model is trained with a more wide variety of data with different comprehensive ratings. In our scenario we didn't consider the entire dataset because we have more comprehensive ratings related to 3 and 4 than other comprehensive ratings. To proceed with K-NN calculation process, Euclidian distance is used to find the distance between each testing data to training data as shown in Equation 2. Table 11 shows the confusion matrix of validation data compared with the actual comprehensive ratings given by the inspector. We have compared our same data set using Analytical Hierarchy Process (AHP) and Naïve Bayes' Classifier (NBC) with the actual comprehensive ratings given by the inspector. Table 12 shows the confusion matrix of AHP, and Table 13 shows the confusion matrix of Naïve Bayes Classifier. The achieved overall accuracy of all the models is shown in Table 14 . Table 15 shows the accuracy, precision, recall, and F1 score for all 5 predicted comprehensive rating, and Equations 3 through 7 present the overall accuracy, precision, recall, and F1 score, respectively. where TP, FN, FP, and TN represent the number of true positives, false negatives, false positives, and true negatives, respectively. In summary K-NN classifier with identified factors after Shapiro Wilks test is superior to Naïve Bayes classifier and AHP for classifying defect ratings based on the 10 factors and reducing the manual efforts of the inspector. The proposed condition rating model assesses the overall state of degradation of the wastewater pipe, combining a series of physical characteristics, external characteristics, and hydraulic characteristics. The model considered 12 initial factors that contribute to the sewer pipe degradation and finally incorporated 10 factors based on the Shapiro Wilks test. Diameter and Seismic zone were the same for all the data; hence the data were not normally distributed, which resulted in the rejection of those two factors for our model A K-Nearest Neighbor (K-NN) model was used to find the pipe rating. To validate the model, the predicted Comprehensive ratings of our model were compared with actual comprehensive ratings, and our accuracy was 73.23% which is satisfactory. We also compared predicted comprehensive ratings using AHP and Naïve Bayes classifier with actual comprehensive ratings, and the accuracy were 9.35% and 52.90%, which shows the K-NN model is more accurate in predicting the comprehensive rating. The main limitation of the study was the data. All the pipes' data had the same diameter and seismic zone. Therefore, more pipe from different geographic locations is needed to improve and convey more robustness to the obtained results. In addition, for future research, more experimental applications to case studies are suggested for refining and improving the number of structural, operational, and hydraulic factors used in the model by considering more variety of data. By adding more factors, this method could be applied to any wastewater pipes to recognize the worst condition of wastewater pipes that need to be replaced immediately. In significantly less time by reducing many manual efforts. Secondly, integrating the consequence-of-failure model with comprehensive rating models can help utility managers prioritize critical investment needs in a more efficient manner Development of a robust wastewater pipe performance index Guidelines for Implementing Risk-Based Asset Management Program to Effectively Manage Deterioration of Aging Drinking Water Pipelines, Valves and Hydrants Virginia Tech Assessment of infrastructure inspection needs using logistic models Money Down the Drain 1. The Sewer Renewals Project of the Severn-Trent Water Authority Infrastructure condition prediction models for sustainable sewer pipelines USGS water data for the nation Factors influencing the structural deterioration and collapse of rigid sewer pipes The structural condition of rigid sewer pipes: a statistical investigation New integrated condition-assessment model for combined stormsewer systems JMP statistical discovery software Modern industrial statistics: With applications in R, MINITAB, and JMP Comparative analysis of sewer physical condition grading protocols for the City of Edmonton An Examination of the Defects Observed in Six Kilometres of Sewers Groundwater levels for the nation Seismic hazard maps and site-specific data Condition assessment and rehabilitation of large sewers Pipeline condition prediction using neural network models Analysis of defects in 180km of pipe sewers in Southern Water Authority Structural condition scoring of buried sewer pipes for risk-based decision making Novel approach in pipe condition scoring PACP Condition Grades K-nearest neighbor United States land cover projections Algorithm AS 177: Expected normal order statistics (exact and approximate) Algorithm AS 181: the W test for normality An extension of Shapiro and Wilk's W test for normality to large samples Approximating the Shapiro-Wilk W-test for non-normality Remark AS R94: A remark on algorithm AS 181: The W-test for normality Comparison of Sewer Conditions Ratings with Repair Recommendation Reports Small wastewater systems research US traffic volume data Physics-Informed Neural Network Method for Solving One-Dimensional Advection Equation Using PyTorch Prediction of the number of covid-19 confirmed cases based on k-means-lstm Wastewater pipe condition rating model using multicriteria decision analysis Consequence-of-failure model for risk-based asset management of wastewater pipes using Challenging issues in modeling deterioration of combined sewers Fuzzy approach for pipe condition assessment