key: cord-0058526-ib7jjbfw authors: Valier, Agostino title: The Cross Validation in Automated Valuation Models: A Proposal for Use date: 2020-08-26 journal: Computational Science and Its Applications - ICCSA 2020 DOI: 10.1007/978-3-030-58814-4_45 sha: 314679be7bbd1a25af5bf68d3c1d0bca641765a2 doc_id: 58526 cord_uid: ib7jjbfw The appraisal of large amounts of properties is often entrusted to Automated Valuation Models (AVM). At one time, only econometric models were used for this purpose. More recently, also machine learning models are used in mass appraisal techniques. The literature has devoted much attention to assessing the performance capabilities of these models. Verification tests first train a model on a training set, then measure the prediction error of the model on a set of data not met before: the testing set. The prediction error is measured with an accuracy indicator. However, verification on the testing set alone may be insufficient to describe the model’s performance. In addition, it may not detect the existence of model bias such as overfitting. This research proposes the use of cross validation to provide a more complete and effective evaluation of models. Ten-fold cross validation is used within 5 models (linear regression, regression tree, random forest, nearest neighbors, multilayer perception) in the assessment of 1,400 properties in the city of Turin. The results obtained during validation provide additional information for the evaluation of the models. This information cannot be provided by the accuracy measurement when considered alone. Mass appraisal models are techniques for the evaluation of large sets of real estate assets, carried out using common real estate data, a unique evaluation protocol and performance verification tests [1] . Mass appraisal models are automatic calculation models, and therefore they are defined with the acronym AVM (Automated Valuation Models). Each model expresses the function that links real estate characteristics to their prices. There are different ways in which a model can be defined. On the one hand, econometric models are based on hedonic price theory. On the other hand, machine learning models that learn directly from data [2] . Each model, once it has been defined, must be verified. The model verification phase consists of a test that provides an accuracy indicator which measures the model's error in price prediction. Accuracy tests have been very successful in scientific research. They have been used both for the evaluation of proposed models and for the comparison between various models. The accuracy indicator is a quick indicator, easy to understand even for people who are not experts in writing algorithms. However, it is often insufficient to fully describe the nature and performance capabilities of the model. The accuracy indicator therefore needs to be implemented with other information that is equally capable of being understood by non-experts. All tests that measure the predictive capacity of mass appraisal models adopt the same protocol. First, the real estate dataset containing the characteristics of the properties X and their prices Y-sales prices or asking prices-is divided into two sub-sets. On the one hand, the training set with which the model is trained. It tries to find the relationship between the variables X and Y. Next, the model is tested on the remaining part of the dataset (testing set). This subset contains data with which the model has not been trained, i.e. data so far unknown to the model. The actual prices (y) contained in the testing set are then compared with the predicted prices (ŷ) formulated by the model. The smaller the difference betweenŷ and y will be, the more the model can be defined as effective [3] . The accuracy measurements of mass appraisal models are widely used in the scientific literature. Many papers have as their main objective to compare different AVMs with each other based on accuracy parameters [4] . The introduction of machine learning models has increased the number and diffusion of these papers. The first patented Automated valuation models employed regression analysis. Actually, this type of analysis is more effective in its inferential purposes than in its predictive ability. The inferential purpose is the ability to explain the role that individual regressors (property features) play in the final price [5, 6] . Machine learning models, on the other hand, are more skilled in predictive capacity. Their main weakness is their black box character: they are not able to explain the relationship between the price of the property and its characteristics. The comparison between regression analysis and machine learning models is not only a comparison between different techniques. Many authors understand it as the comparison between human intelligence and artificial intelligence in performing the same activity: evaluating [7] [8] [9] [10] . In the field of mass appraisal the first machine learning models to be used were artificial neural networks. The first to propose accuracy tests was Borst [11] . Afterwards, Do and Grudnitiski [12] compared artificial neural networks with multiple regression by measuring their predictive capacity. These contributions triggered a debate in the scientific community, within which there were also those who refuted the superiority of machine learning models over traditional econometric models [13] . Nguyen and Cripps [14] were credited with studying the relationship between artificial neural networks and the size of their dataset. Still nowadays, artificial neural networks are successfully tested [15, 16] . Gradually, other types of machine learning models have been tested in mass appraisal techniques. Among them, the k-nearest neighbors. Such algorithms have rarely proved effective [17, 18] . In contrast, support vector machines have been shown to provide reliable estimates [19] . Even genetic algorithms have been successfully tested by researchers [20] [21] [22] . The ensemble learning models combine several individual models within a single metamodel, which offers better performance than those performed by each model considered alone. This is the case of random forests, which are the result of the aggregation of several regression trees. Their effectiveness in predicting real estate value is widely demonstrated [23] [24] [25] [26] [27] . The measurement of predictive effectiveness is in fact one of the main analysis tools with which to deal with the new models that have appeared more recently in the debate on real estate valuation. However, the excessive insistence that research exerts on the results of these tests risks making accuracy the only evaluation parameter of mass appraisal models [28] . This research methodology, although very widespread, has a statistical limit. The subdivision into only two subsets can create asymmetry in the data distribution. The split between training set and testing set takes place in a random way. The division into only two groups does not ensure a uniformity of the treated sub-samples. Sampling asymmetries often generate overfitting phenomena. Overfitting occurs when the model, over-trained on the training set data, is unable to predict effectively when it is provided with the testing set data. Some techniques can be used to prevent the phenomenon, the best known is cross-validation [29, 30] . This study uses as data the offer prices of 1416 residential units for the year 2013. These prices have been collected from reading sales announcements. To implement the information contained in the dataset, data from the Oict monitoring centre (Osservatorio Immobiliare Città di Torino) were used. This monitor is managed by the City of Turin, the University (Politecnico di Torino) and the local Chamber of Commerce. It is a tool that collects annually real estate prices since 2000. The Oict divides the city into 40 microzones. The microzone and its average sales value for the year 2013 have been added as variables to the dataset. The dataset has 13 variables, described in the table below (Table 1 and Table 2 ). The missing values of the variables 'Locali', 'Bagni', 'Balconi', 'Box', 'Affacci', 'Cantina', 'Piano', 'Piani edificio' have been replaced with the most frequent value of the corresponding variable. The replacement technique, called imputation, is frequently used to deal with missing values in large datasets. It consists of replacing each missing value with the average value or the most frequent value of the variable to which the missing value belongs. After the imputation process, the dataset consists of 1416 properties. Each of them is described by 13 variables. As can be seen from Fig. 1 , the correlations between variables are very weak. The 1416 price dataset is divided into two sub-sets: the training set (75% of total data) and the testing set (25% of total data). Five models have been identified and trained. The first is a traditional econometric model, the remaining are more properly machine learning models. The models used in this research are the following: The models have been taken from the scikit-learn library. The GridSearchCV tool was used to detect the hyperparameters. It performs an exhaustive search by evaluating the model performance for each of the combinations in the list of values provided by the authors. It tests each combination until the optimal set is obtained. Here the grid search has been further optimized through its use in combination with cross-validation. K-fold cross-validation is a statistical technique that consists of subdividing the training dataset into k parts of equal numerosity. At each step, the k-th part of the dataset becomes the validation dataset, while the remaining part constitutes the training dataset. Then k verifications are made by testing each time -on the data of the k-th part subtracted from the training set -the model trained on the parts of training dataset not excluded. The number k of parts with which the training set is divided is at the discretion of the authors, generally assumes the value of 5 or 10. In this research will always be used the ten-fold cross validation. This allows the model to have verification steps even before the final verification measured on the testing set. This reduces the possibility of overfitting. The combined use of GridSearch with cross validation first identifies the best set of hyperparameters. Then it tests 10 times the algorithm (with the newly identified parameters) on the values of the training set ( Table 3 ). The hyperparameters found by GridSearch tool more Cross validation are the following: The results are summarized in Table 4 . Then the models were tested on the testing set. They recorded the scores of Table 5 . Reading the results shows a clear superiority of machine learning models over linear regression analysis. The econometric model -here represented by linear regression analysis -records the lowest scores. However, this research is not limited to the reading of the testing scores. The analysis of the results is divided into two parts. In the first part only the validation scores are analyzed, in the second part the validation scores are related to the testing scores. The aim is to verify whether the models -beyond the differences in their internal structures -have a different behaviour in predicting results. For this purpose, an ANOVA single factor analysis was carried out of 50 results obtained during the cross-validation phase ( Table 6 ). The F value is greater than F critical value. The H 0 hypothesis, according to which the mean values recorded by all models are equal, is therefore rejected. Also the p-value, lower than the alpha value, confirms the rejection of the hypothesis H 0 . In Table 7 the ANOVA analysis is repeated, this time only on the 4 machine learning models (Regression tree, Random forest, Nearest neighbors and Multilayer Perceptron). In this case F greatly exceeds F critical value, just as p-value exceeds alpha value. The hypothesis that the statistical behaviours of the various groups are similar to each other is therefore accepted. These two ANOVA analyses show that the differences between traditional econometric models and machine learning models are not limited to the internal structures of the models. In fact, there are also differences in the results obtained and in their distribution. The second part of the analysis correlates the results of cross-validation with the testing scores. The models that in the cross-validation phase recorded 10 similar values (and therefore, a low variance value) did not increase their performance once evaluated on the testing set. On the contrary, the scores decreased. The linear regression suffered a drastic drop in performance (from 0.70 to 0.63). The decrease of k nearest neighbors was more contained (from 0.83 to 0.82). The multilayer perception algorithm records too high variance values, so it cannot be compared with the previous two. Very high values in the variation of the values indicate a random behavior of the model, so it is difficult to make predictions. Actually, on the testing set it has not suffered a drop-in performance but an increase. The other two models (Regression tree and Random forest) have higher variance values. This means that the values obtained in the cross-validation phase are distributed over a wider range. These models experienced an increase in predictive performance once tested on the testing set. Their final accuracy results are higher than the average of the 10 cross validation values. The research uses the cross-validation tool to provide additional information to forecasting accuracy tests, commonly summarized in the accuracy parameter alone. Five models (one econometric, the remaining four machine learning) were used to predict value using 1416 properties in the city of Turin. The research came to two conclusions. The first conclusion was obtained through the use of ANOVA analysis on the results of cross-validation. The analysis shows a different behaviour between the econometric model and the machine learning models. In fact, if all 5 models are considered together the hypothesis that the mean values are equal is denied. This hypothesis is confirmed if only the 4 machine learning models are considered. The cross-validation highlightsalso from the statistical point of view -the different way in which traditional models act instead of artificial intelligence models do. The second conclusion shows that models whose validation scores have a low variance lose their predictive efficacy once verified on the testing set. The average of the values obtained in the validation phase is in fact higher than the final accuracy output. Vicecersa, models with higher variance of the validation scores, during the testing phase can obtain higher scores than the average of the results obtained in the validation phase. This is not true for models that have such a high variance that their results have a semi-aleatory behavior. The models tested in this research are not sufficient to give a general character to these conclusions, more studies will be needed to investigate the phenomenon. The impact of machine learning on economics Prediction accuracy in mass appraisal: a comparison of modern approaches Who performs better? AVMs vs hedonic models Does sustainability affect real estate market values? Empirical evidence from the office buildings market in Milan (Italy) New bottom-up approaches to enhance public real estate property Hedonic prices and implicit markets: product differentiation in pure competition Advances in Automated Valuation Modeling. SSDC Mass appraisal methods: an international perspective for property valuers A machine learning approach to big data regression analysis of real estate prices for inferential and predictive purposes of real estate prices for inferential and predictive purposes Artificial neural networks: the next modelling/calibration technology for the assessment community A neural network approach to residential property appraisal An exploration of neural networks and its application to real estate valuation Predicting housing value: a comparison of multiple regression analysis and artificial neural networks Artificial neural networks for predicting real estate prices. Cuantitativos para la Economia y la Empresa Impact of artificial neural networks training algorithms on accurate prediction of property values Valuation analysis of commercial real estate using the nearest neighbors appraisal technique Real estate investment advising using machine learning The mass appraisal of the real estate by computational intelligence Using genetic algorithms for real estate appraisals Property valuations in times of crisis. artificial neural networks and evolutionary algorithms in comparison Using genetic algorithms in the housing market analysis Estimating the performance of random forest versus multiple regression for predicting prices of the apartments Machine learning: an applied econometric approach Big data in real estate? From manual appraisal to automated valuation Mass appraisal of residential apartments: an application of random forest for valuation and a CART-based approach for model diagnostics Predicting home value in california, united states via machine learning modeling Market value without a market: perspectives from transaction cost theory Identifying real estate opportunities using machine learning Predicting property price index using artificial intelligence techniques