key: cord-0188919-64qg0ci7
authors: Wang, Zimo; Wang, Yicheng; Wu, Sensen
title: House Price Valuation Model Based on Geographically Neural Network Weighted Regression: The Case Study of Shenzhen, China
date: 2022-02-09
journal: nan
DOI: nan
sha: 414fccdeb81b4c74fecec8265eb146b06dc6c1c5
doc_id: 188919
cord_uid: 64qg0ci7

Confronted with the spatial heterogeneity of real estate market, some traditional research utilized Geographically Weighted Regression (GWR) to estimate the house price. However, its kernel function is non-linear, elusive, and complex to opt bandwidth, the predictive power could also be improved. Consequently, a novel technique, Geographical Neural Network Weighted Regression (GNNWR), has been applied to improve the accuracy of real estate appraisal with the help of neural networks. Based on Shenzhen house price dataset, this work conspicuously captures the weight distribution of different variants at Shenzhen real estate market, which GWR is difficult to materialize. Moreover, we focus on the performance of GNNWR, verify its robustness and superiority, refine the experiment process with 10-fold cross-validation, extend its application area from natural to socioeconomic geospatial data. It's a practical and trenchant way to assess house price, and we demonstrate the effectiveness of GNNWR on a complex socioeconomic dataset.

As one of the countries with the fastest urbanization process, China has steadily rising housing prices in the past decades, especially in its major cities. Affected by the COVID-19 in 2020, the world's major economies entered a liquidity easing cycle, and housing prices in many cities in China rose significantly. [4] On this basis, several Chinese cities, such as Shenzhen, Xi'an, Chengdu, have implemented secondhand housing transaction reference price, which is used to curb housing price rising. The reference price provides us with a reasonable valuation for the housing prices with slightly bubbles.

Housing price is closely related to the life of new urban residents, and it is also an economic index that the government needs to pay close attention to. Exploring the spatial distribution pattern of housing price has great practical significance and guiding value for government regulation, individual house purchase or third-party valuation.

To model and estimate house prices, different models have been developed by many scholars. In 1972, Rosen proposed the Hedonic model, which aims to measure property prices using a number of environmental factors. The early studies mainly consisted of three components: locational traits, structural traits, and neighborhood traits, i.e., residential prices are mainly a function of these three characteristics and are approximately linearly related in an exponentially corrected manner. [8, 27] A number of subsequent studies have demonstrated the relative validity of this model, and measures of these factors are able to estimate more accurately the positive or negative correlation between each independent variable and house prices. For example, HENRY M.K. MOK's modeling of house prices in Hong Kong in 1994 showed that house prices were significantly negatively correlated with the age of the house, distance from the CBD, and significantly positively correlated with the floor. [23] As times progressed, more and more independent variables were taken into account and more statistical indicators were added to test the validity of the model. Further studies have partially incorporated land use planning, and also accessibility in terms of transportation. [5] In recent years, related residential house price studies have incorporated a variety of external environmental factors such as natural landscape and neighborhood size to analyze the impact on house prices. [13] However, these models have constantly encountered problems in dealing with spatial heterogeneity and spatial non-stationarity, i.e., the same independent variable has different effects on house prices in different regions. Ordinary Hedonic models can only model a certain independent variable with constant coefficients, but the real situation often influenced by spatial factors. For example, in suburban areas, transportation conditions dominate house prices and the quality of nearby schools does not matter. However, in downtown areas, the quality of schools near homes might be more critical and nearby transportation conditions are relatively less important. This is something that cannot be analyzed by ordinary Hedonic models.

Further, taking into account the spatial heterogeneity of the different influencing factors, Geographically Weighted Regression (GWR) methods are proposed that allow the coefficients to change at different locations. [7, 10] The method can be understood as a local weighted linear regression for each local area, and the weights fully take into account the effects of adjacent data points according to the first law of geography proposed by Tobler. [32] In order to build a more satisfying model for the relationships between the house price and the area, Brunsdon and Fotheringham have mentioned several key questions GWR faced: the selection of the variables, bandwidth, and the spatial autocorrelation of error after proposing GWR. [6] Many scholars have made attempts on this basis. For example, in 2011 Jijin Geng et al. had used the GWR model to model house prices in Shenzhen. Compared with the Ordinary Linear Regression (OLR) model, the R square improved from 0.56 to 0.79. [12] Zhang et al. used mixed geographically weighted regression to model the rent in Nanjing, i.e., some variables were locally weighted according to geographical location, while some variables were globally weighted, and good results were achieved. [36] Binbin Lu added non-Euclidean distance to GWR, and for some geographic elements that do not obey the standard linear measure, this model achieves better results on the spatial proximity measure of London and can have better estimation performance for house prices. [19, 20] However, the ability of GWR to express nonlinear spatial relationships is quite limited. Therefore, many scholars have resorted to artificial intelligence methods, which have developed rapidly in recent years, to model house prices using their superb fitting ability to nonlinear relationships. [17, 29] Although the estimation performance of neural network models is usually superior to that of GWR models, the spatial distributions obtained by these models are not entirely reasonable and the constructed regression relationships are difficult to interpret spatially, because they ignore the spatial properties of residential price regression relationships.

In recent years, based on the idea of geographic weighting of GWR, Sensen Wu proposed a Geographically Neural Network Weighted Regression (GN-NWR) model by combining OLR and neural network models. [9, 33] Based on the powerful learning ability of neural networks, the potential spatial nonstationarity and complex nonlinear features in regression relations can be well handled. In the current study, GNNWR has effectively solved many problems and has performed well in modeling the ecological environment of nearshore seas [34] , also showed superior explanatory power in the estimation of spatial PM 2.5 concentrations in China. [37] On February 8, 2021, the Shenzhen Real Estate and Urban Construction Development Research Center released reference prices for second-hand housing transactions for the city's 3,595 residential quarters. [3] Based on this dataset, a residential price valuation model can be developed, which covers various factors such as property endogenous variables, subway, and school district conditions. This study attempts to build a residential price valuation model with the help of a relatively new tool, GNNWR, in an attempt to deal with the spatial heterogeneity and spatial non-stationarity present in this data. [9] In summary, this study aims to put the GNNWR model into practice in the socioeconomic field, establish a residential price valuation model based on the reference price data of second-hand housing transactions in Shenzhen, realize the accurate fitting of the spatial heterogeneity and nonlinear relationship of multiple environmental factors in the modeling, and then obtain a more accurate house pricing model than GWR method, with the spatial distribution of multiple factors' influence coefficients. It can provide reference for residential valuation, land auctions, and the reference prices of second-hand housing transactions in other cities. Owing to the increasement of population, with great economic conditions and perfect business environment, the house prices in Shenzhen are also rising. In the short term, affected by the loose liquidity stimulated spurred by the COVID-19, Shenzhen real estate market in 2020 was quite prosperous. The investment in real estate development increased by 16.4% over the previous year; the residential construction area increased by 21%; and the sales area of commercial housing increased by 17.3%, which led to the further rise of house prices as a whole. According to the second-hand housing data of 70 large and mediumsized cities released by the National Bureau of Statistics of China in 2020, Shenzhen's real estate market rose by 14.1% throughout the year. As the only city with an increase of more than 10% in China, it ranked first in the growth rate.

In order to suppress the excessive growth of house prices, in February 2021, Shenzhen Real Estate and Urban Construction Development Research Center formed the reference price of second-hand housing transactions in 3,595 residential quarters, based on the government recorded transaction prices of secondhand housing and the surrounding first-hand housing price. [1] From the perspective of data modeling, the Shenzhen data were selected for the study mainly due to the following factors. First, the reference price of second-hand housing transactions has itself undergone considerable evaluation compared to other data. It averages out the differences in different house types and floors, and also combines government recorded transaction prices and surrounding first-hand housing prices, removing short-term heat and bubbles and reflecting a relatively accurate valuation result for a property. Secondly, Shenzhen's urban development is more natural. There is no important political center, relics or slums affecting urban planning. Finally, the reference prices are introduced in a uniform batch, with a large amount of data and influence. Modeling of the reference price of second-hand house transactions in Shenzhen can provide reference for more cities to introduce similar measures. The total number of complete and effective initial data records obtained in this study is 2871, covering 2871 residential quarters in Shenzhen. The specific data source is https://shenzhen.qfang.com/. In this study, it is divided into the following three categories according to its functions: 

Geographically Weighted Regression (GWR):

In the classic ordinary linear regression (OLR) model, dependent variable and independent variables can be expressed by the regression equation:

where β 0 is regression constant; β 1 , . . . , β p are regression coefficients; i is the error term of the sample with the mean value zero and constant variance σ 2 . Moreover, its coefficient can be estimated as:

In fact, the regression coefficient calculated by OLR model is the best unbiased estimation of all sample points, which can be regarded as the average relationship in the whole study area. The spatial variation of this relationship can be regarded as different fluctuations of the "average relationship" caused by spatial non-stationarity.

Based on the first law of geography, some scholars proposed a spatial weighted regression (GWR) model, trying to change the regression coefficient from global to local, and change the weight of adjacent points according to different distances in the regression framework. GWR model defines spatial non-stationarity as [10, 35] :

Therefore, we can regard w 0 (u i , v i ) as the non-stationarity weight of the regression constant β 0 , and w k (u i , v i ) represents the non-stationarity weight of regression coefficient β k . Substituting the estimated value of OLR β k into the above formula, the estimated value can be obtained as follows:

The estimator in matrix form can be expressed as:

In GWR model, the weight kernels usually use Gaussian, bi-square, tri-cube and exponential functions. These functions can relatively simply express the complex relationship between spatial proximity (i.e. spatial distance) and spatial non-stationarity (i.e. spatial weight).

It should be noted that there are different ways to select the function in the spatial weight matrix. Different selection methods directly affect the final modeling accuracy.

The Gaussian weighted function can be expressed as:

where d s ij is the distance between points i and j; b, the bandwidth, producing a declining effect relative to d s ij , has different methods to select: for the fixed Gaussian weight function, the bandwidth is the same at each point and is a constant in the same model; for the adaptive Gaussian weight function, the bandwidth is different at each point, and the point distance closest to the point is often taken as the value of bandwidth. In any case, the Gaussian weight function requires a variable input, that is, the distance range (fixed bandwidth) or the number of major adjacent features (dynamic bandwidth).

The bi-square weighted function can be expressed as:

the others.

where d s ij is the distance between two points; b i is the bandwidth. It is also divided into fixed type and adaptive type according to the above method.

This model is built using adaptive functions, i.e., an input variable is needed to select the number of major neighboring elements, and the AICc criterion is used to determine whether it is more preferable. [11] Geographically Neural Network Weighted Regression (GNNWR): Similarly, based on the nonstationarity in the spatial relationship, GNNWR goes further than GWR, trying to further accurately describe the fluctuation level of spatial non-stationarity on the regression relationship at different locations. The key step of GWR is the selection and construction of spatial weight matrix function. On this basis, GNNWR attempts to go further and find an appropriate spatial weight matrix function with the help of neural network.

To accurately fit the complex relationship between spatial distance and spatial weight, GNNWR designs a spatial weighted neural network (SWNN) to achieve the neural network expression of weight kernel function. Specifically, SWNN takes the spatial distance between points as the input layer and the spatial weight matrix as the output layer, and selects the appropriate number of hidden layers according to the modeling needs. The spatial weight calculation of the points corresponds with:

is the spatial weight matrix as:

That is, this matrix is the result of function W : R 2 → R (1+p)×(1+p) . SWNN further considers the existence of an intermediate variable

where d ij is the distance from point i to sample point j. Thus, the GNNWR-based house price estimation model is shown as Figure 3 :

Significance Test Statistics for Spatial Nonstationarity: To test whether the relationship has significant spatial non-stationarity, we use the residual sum of squares and its approximated distribution deduced by Leung i.e. [16] and Wu [33] , for significance tests of GNNWR and GWR modeling results.

Firstly, express the hat matrix of GNNWR as:

The statistical quantities F 1 is obtained as:

The distribution of F 1 can be approximated as F distribution，where δ 2 1 δ 2 is the degree of freedom of the numerator and n − p − 1 is the degree of freedom of the denominator. That is, given a significance level α, if the inequality

holds, it can be determined that the regression relationship has significant spatial non-smoothness, otherwise the spatial non-smoothness is not significant.

Second, the significance of the spatial nonstationarity can also be checked for each independent variable one by one. The null hypothesis here is that the weight of this independent variable is the same everywhere in the space. The alternative hypothesis is that the weight of this independent variable differs in at least one place in each part of the space. First, define the variance of the weight of the kth independent variable over the n data points.

Define e k as a n-rank vector with the (k + 1) th element having value 1 and other having value 0. Define as a square matrix of order n with each element having value 1. 

The statistical quantities F 2 is obtained as:

The distribution of F 2 (k) can be approximated as F distribution, where σ 2 is the mean square error, γ 2 1k γ 2k is the degree of freedom of the numerator and δ 2 1 δ2 is the degree of freedom of the denominator. That is, given a significance level α, if the inequality

) holds, the null hypothesis can be rejected and the variable k is determined to have significant spatial non-stationarity, otherwise the spatial non-stationarity is not significant.

Indicators of Model Performance: The paper uses the following metrics to evaluate the performance of the model. First in the AICc guidelines, the correction of Akaike information criteria (AIC c ) [10] is as follows:

The method is applicable for both GWR and GN-NWR. In practice, the smaller the value, the better [11] , and we use AIC c to select the appropriate input parameters for the GWR model. Other measures of model performance include: coefficient of determination (R 2 ), root mean square error (RMSE), mean absolute error (MAE), and mean absolute percentage error (MAPE). The definitions are as follows:

Among them, y is the average of the observed values; σ 2 is the mean square error of the model and p is the effective degree of freedom of the model.

The model uses a traditional neural network and the specific process is shown in Figure 4 . Combined with 10-fold cross-validation can ensure the robustness and reliability of the algorithm. All layers of the spatially weighted neural network are full-connected with each other, and the Dropout technique proposed by Srivastava et al. is applied to improve the generalization ability of the model. [30] Each hidden layer is combined with the Batch Normalization technique. The parameter initialization is adopted by the method proposed by He [15] et al. and the activation function is adopted by PReLU.

In the training process of GNNWR, we use the RMSE as the loss function. We used the more popular Adam optimizer and achieved better results than the stochastic gradience descent used in GNNWR's previous practice. When the loss function of the validation set grows or remains constant beyond a certain number of iterations, the model is considered to be overfitted and the neural network computation is automatically stopped. For a given set of hyperparameters, 10 models can be generated on a randomly selected 10-fold data set (a total of 2440 items, accounting for 85%) with 9 folds selected as the training set and the remaining 1 fold as the validation set. Secondly, by summing the loses of these ten models, the total model loses corresponding to the hyperparameters can be obtained. Finally, the hyperparameter of the model with the lowest mean value of loss was selected as the best, and the GNNWR it generated will be compared with the other two schemes.

After several trials, in the hyperparameters, the value of learning rate is 10 −2.95 ≈ 0.00112, β 1 = 0.8, β 2 = 0.999, batch size = 128. Percentage of loss at Dropout layer is 0.9; Epoch has a maximum number of iterations of 90,000. After comparing the results of several neural network hidden layers, the results are shown in Table 1 . Note that the data used to calculate the mean squared error here are derived from normalized house prices, so the order of magnitude is different from the analysis below. After pre-experimental comparison, it was found that increasing the number of hidden layers not only greatly improved the fitting accuracy, but also did not significantly weaken the generalization effectiveness. Considering the number of neurons in the input and output layers, the article adopts a 6-layer neural network structure, containing 1 input layer, 4 hidden layers with 512, 128, 64, 16 neurons, and 1 output layer. The number of neurons in the input layer is the number of training samples, and the number of neurons in the output layer is the number of parameters of the linear regression model (the number of independent variables plus one).

To further reflect the optimization in the iterations, Figure 5 shows the change in test indicators of one fold during model training. After running more than 30,000 epochs, if the loss on the validation set is not improved after 9000 epochs, the neural network training will be terminated.

To fully demonstrate the superiority of GNNWR, three models, OLR, GWR, and GNNWR, are developed and compared on the Shenzhen house price dataset. Among them, GWR uses the golden search method to find the most suitable number of neighboring elements according to the AICc index.

Since the NSS, QASP variables take values in small integers, the design matrix used in GWR modeling will show multicollinearity when the number of neighboring elements is small. Therefore, the search lower bound is set to 100, i.e., at least 100 neighboring elements are involved in the solution of the local regression coefficients.

Through a simple pre-experiment, 2440 data are extracted for modeling and the remaining 431 data are used for testing, and it can be compared to find that bi-square significantly outperforms Gaussian among the two weighting kernel functions of GWR.The specific parameters are shown in the following Table 2 . Therefore, the next comparisons in this paper all take bi-square as the weight kernel function of the GWR model.

In the experiments, we used OLR as a comparison. Both OLR and GWR solutions are built on ArcGIS Pro. GNNWR built with TensorFlow 1.15.0 library under Python 3.6.13 kernel. The commonly We use these three modeling approaches with the help of 10-fold cross-validation to be able to build the model on the training set, use the results on the validation set to calculate each metric of the model, evaluate the generalization ability of the model, and exclude the influence of chance factors. Finally, the modeling result with the strongest generalization ability is selected for all three methods, and the predictive ability of the model is tested on the test set. For 2871 data, we extracted 431 of them (about 15%) as the test set, and the remaining 2440 data were equally divided into 10 folds to participate in cross-validation, each containing 244 data (about 8.5%).

The results of correlation analysis and descriptive statistics of Shenzhen house prices with the respective variable factors are shown in the Table 3 below. As can be seen from the table, ranked by the absolute values of the correlation coefficients in descending order, the variables are SD, MF, NSS, DSS, GR, AB, PR, QAPS, NPS. The variables positively correlated with house prices are MF, NSS, GR, PR, QAPS, NPS, and negatively correlated with SD, DSS, AB. The average price of residential housing in Shenzhen sampled in this dataset is 62219.33yuan/m 2 . From the highest 132000yuan/m 2 to the lowest 16100yuan/m 2 , the value domain basically covers the reference price of second-hand housing transactions for all residential housing in Shenzhen at present. VIF measures the extent to which this independent variable is influenced by other independent variables, and since all of them are extremely close to 1, it can be found that the degree of multicollinearity of the data source is minimal.

Ranked by coefficient of variation in descending order, the variables are QAPS, Price, GR, AB, PR, MF, NSS, NPS, SD, DSS. After normalization, QAPS is the variable with the greatest difference, while DSS varies the least among properties. In general, the price fluctuation is also relatively large, and the price difference between high-quality and low-quality properties is obvious, which more truly reflects the scarcity and non-renewable nature of land resources.

It should be noted that this study also conducted hypothesis testing for each variable in the global regression equation using R language. For each variable, it is assumed that its coefficient is zero in the global regression equation, and a test statistic that satisfies the t-distribution when the hypothesis holds can be constructed. Correspondingly, the p-value can be calculated. The p-values of AB, MF, GR, SD, NSS, and DSS will be recorded as 0 because they are less than the accuracy threshold of 2.2 × 10 −16 . The p-values of NPS were also very small and highly significant. It can also be found that the p-value of PR is not significant at the significance level of 0.01 and the p-value of QAPS is not significant at the significance level of 0.05 or even 0.1. If a global regression is used, these two independent variables should be excluded. However, two spatial statistical modeling methods, GNNWR and GWR, are used in this study, and the significance of each variable in this model can be retested with the help of the F2 statistic in this paper. According to the analysis of non-stationarity diagnostics in 3.3, both variables are highly significant, when the coefficients of the linear model are allowed to vary with geographic coordinates. This proves the superiority of the spatial statistical modeling approach from another side.

The evaluation of the house price valuation model examines both the ability to fit on the training set and to predict on the test set. We stochastically divides 2871 house price records into the train set and validation set with 2440 records as 10 folds, and 431 records remnant working as the test set. We evaluated all parameters of the model using parameters such as coefficient of R 2 , RMSE, MAE, MAPE, AICc and Pearson correlation coefficient. For the dataset generated after the 10-fold crossover, the validation sets are merged and the following results are obtained.

Clearly, these data confirm the greatness of the GNNWR model. The worst prediction comes from the OLS model, which has the lowest R 2 and the highest prediction error measured by RMSE, MAE and MAPE. Because of the severe spatial non-stationarity, the OLS model is difficult to detect the intrinsic relations and spatial fluctuations between house price and other independent variables. Compared with GWR model, the RMSE of GNNWR model declines about 13.0%, and the MAE of GNNWR model declines about 13.5%. Other indicators, like R2 and MAPE, also make the superiority of GNNWR model clear. Additionally, the mean residual error of GN-NWR model is much lower than GWR model's with a 62.2% reduction, which means that the prediction of GNNWR model has a greater unbiasedness than GWR's on this dataset. In short, we can deduce that the GNNWR model gains a notable progress on the generalization ability.

To be more specific, we can compare the indica- Table 4 : Indicators of GNNWR, GWR and OLS on Merged Validation Set tors of the GWR and the GNNWR models in each process of modeling on 10 train sets. There parameters reflect the fitting quality of modeling process. In Table 5 , the Train set of 0 means that data set 0 is excluded and the 1, 2, ..., 9 data sets are selected, and so on. It should be noted that the number after GWR refers to the number of most suitable neighboring elements selected based on the AICc value. Since the training set is slightly different, the most appropriate number of neighboring elements is re-picked each time the GWR model is built.

For all of these 10 data sets, GNNWR models have completely beaten GWR models no matter we utilize AICc, RMSE, R2 or Pearson correlation coefficient as a judge. The evident advantage on AICc reveals that the GNNWR model not only provides a better prediction about house price, but also applies a more accurate space weight matrix without much more complexity. In contrast, the GWR model has to face the overfitting problem, which makes the correctness of the predictions on the validation sets slump. To sum up, the GNNWR model producing a more capable kernel function than any GWR models, performs most outstandingly in catching spatial heterogeneity details, estimating spatial weight and predicting dependent variables.

Furthermore, we can judge the generalization ability by predicting the test set. In this study, we use the models with the best generalization ability to compare. Both of the GNNWR and the GWR models perform best when we opt the validation set as dataset 4, and the other indicators are shown in the next Table  6 .

Compared with the GWR model, the GNNWR model has an explicit superiority about predicting the test dataset. The MAE slumps 10.2% and the MAPE descents 10.7%, which are practical for real estate agency to have a better estimation. The RMSE reduces 6.6%, the R2 and the Pearson correlation coefficient has improved as well and the mean error has increased.

Based on the spatial heterogeneity diagnostic indicators discussed above, the results based on GNNWR can be analyzed in two parts. First, it is possible to identify whether the model results have a relatively significant spatial nonsmoothness. In the ten-fold data, the prediction effect parameters of each GNNWR model in the validation set are as Table 7 .

Using RMSE as the index, the best fitting model (model 4) and the worst fitting model (model 3) were selected for hypothesis testing. The hypothesis testing parameters were calculated from the previous derivation as the following Table 8 .

Following F 1 value, we can deduce the p value so that the hypothesis establishes by calculating the F distribution. It is notably that the hypothesis is rejected and it is significant that there is severe spatial non-stationarity when modeling Shenzhen house price.

Next, we could analyze the significance for each independent variable. To every independent variable, the null hypothesis is that the coefficient of this variable is a constant. It is to be noted that this hypothesis includes another hypothesis which assumes the coefficient of this variable is 0 everywhere. Therefore, the p value of F 2 can reject both of the hypotheses if it is tiny enough. All of the details are shown in the Table 9 .

Evidently, every independent variable has signifi- Table 9 : F2 Hypothesis Test cant influence on house price, and each of their influence varies a lot among the region. Hence it has also been proved that all of them have significant spatial non-stationarity. What' s more, this simple comparison also hints that a better model may require a higher spatial non-stationarity estimation on variables and a lower spatial non-stationarity estimation on the intercept.

The relative error rate of each prediction is calculated on the validation set and test set, which can be plotted as the following scatter plot in Figure 6 . Among them, the feature directions of the point cloud can be found according to the method of principal component analysis (PCA), and are plotted on the graph using black dashed lines. It should be noted that we use the same range of axes when plotting the point cloud in order to make the comparison clearer, and there are no more than 5% of points outside this range that are not shown. It is not difficult to find that GWR and GN-NWR have great superiority over the OLS models. The point set is densely distributed close to the yaxis, indicating that most of the locations that cannot be well predicted by OLS can be more effectively and accurately predicted by spatial statistical models. The idea of local linear regression can effectively reduce the prediction error. Comparing GWR with GNNWR, we can find that the feature direction lies above y=x, i.e. GNNWR can reduce the prediction error of GWR model at the same location by a certain proportion.

Further, we compare the two models by a Q-Q plot and a histogram chart as Figure 7 , which are plotted by Matlab. Again, no more than 5% of the points are not shown outside this range. Ordering the relative error rates, it can be found that the relationship between the k th value on the validation set is approximately δ GN N W R + 0.0003. These reference lines that represent the theoretical distribution have a clear deviation with y = x, which enable us to confirm the superiority of GNNWR models.

A comparison of the histograms as Figure 8 still gives clear results. On the validation set, taking the histogram horizontal coordinates between [0,1] and bin width of 0.09, it can be found that 9 of the 11 bins with error rate less than or equal to 9.9% have more data from GNNWR model.This trend is also ev-ident in the test set. Setting the histogram horizontal coordinates between [0, 1] and bin width taking 0.15, similarly it can be found that five of the seven bins with error rate less than or equal to 10.5% have more data from the GNNWR model. Besides, we can combine the prediction data of validation set from both models as Figure 9 . The numbers of data in both sets which are below certain value can be calculated, and the ratio of two numbers can be plotted as blue line on the graph. The ratio of the number of predicted data from GWR to the number of data from GNNWR when the statistical error rate is above a certain value can be plotted as the orange line on the graph. For all data (two times of predictions on 2440 records) with a relative error rate of less than 0.203, the predictions from GN-NWR are 1.34 times higher than those from GWR. In contrast, among all data with error rates higher than 0.37, the predictions from GWR are as much as 1.62 times higher than those from GNNWR. In conclusion, the predictions from GNNWR account for more of the high-precision predictions and the predictions from GWR account for more of the high-error predictions. Comparing with other literature, it can be found that another study also supports the conclusion that GWR can significantly reduce the prediction error compared to OLS models, indicating that spatial heterogeneity exists. In another study on Shenzhen house prices, the authors used the GWR model to increase the R 2 from 0.56 to 0.79. [12] Some simple AI models, such as decision tree models, can even predict worse than OLS if they are not designed properly. [31] In a separate study comparing the OLS model with multiple models, the best Stepwise and tuned SVM model reduced the RMSE by 25%, the polynomial regression model reduced the RMSE by 8.3%, and even the optimal simple neural network selected from the 1-3 hidden layers increased the RMSE by 66%. [26] Since the 1990s, scholars have been trying to use ordinary neural network models to predict house prices and compare them with OLS models. Some studies have demonstrated the superiority of the neural network approach, but others have found that there is no great need to use neural networks. Considering the 47% reduction in RMSE metrics compared to OLS in this study, it is easy to see that simply using complex functions trained by neural networks to approximate the training data set does not improve the prediction accuracy, and that a GWR-based framework can best capture information on the geographic distribution. These indicate that accurate estimation of spatial heterogeneity is extremely necessary.

The GNNWR model is based on the structure of linear regression, where different weights are assigned to different variables based on the location of the prop-erty to capture spatial heterogeneity. For the ten-fold dataset obtained in this study, the weights of different independent variables at each prediction point can be visualized and output after merging the validation sets among them. This section focuses on the analysis of the significance of these weights.

Since the data are pre-normalized when they enter GNNWR training, the values here can be compared directly in Table 10 . As can be seen from the mean values, the degree to which each independent variable affects house prices is different, and after taking the absolute values, they are SD, MF, AB, NPS, DSS, GR, NSS, QASP and PR, from the highest to the lowest. After accounting for spatial heterogeneity, the effect of NPS and AB on house prices is larger than that estimated using the correlation coefficient, and the effect of NSS on house prices is smaller than that estimated using the correlation coefficient. However, the positive and negative correlations of house prices are not violated, still MF, NPS, GR, NSS, QASP, PR are positively correlated with house prices and SD, AB, DSS are negatively correlated with house prices. The standard deviations of these weights were compared, from highest to lowest, as DSS, MF, SD, AB, NSS, NPS, GR, PR, and QASP. That is, public transportation conditions represented by DSS have greater spatial heterogeneity and school district conditions represented by QASP have less spatial heterogeneity, which is in line with the majority's intuition. It can be speculated that since the value of quality educational resources is similar for residents in all parts of the city, the school district factor contributes to house prices with a more stable weight in all parts of Shenzhen. It can also be presumed that the distance to the subway station is not so important for residences living in the CBD or closely nearby the subway entrances. However, in the ordinary residential areas of the city and suburbs lacking wealthy people, the distance to the subway station is quite important. In-depth analysis requires specific distributions about the weights of each variable, as shown by Figure a-z. These figures are based on the natural breakpoint method with inconsistent color ranges for different subplots, and the boundaries around 0 are fine-tuned to show positive and negative correlation features. Due to the small standard deviation, the data of PR and QASP were classified into 6 levels only, while all other variables were classified into 8 levels. Overall, the modeling results based on ten different training sets are smooth, with few mutations and outliers in geographic proximity. They are quite consistent when making predictions for the weight distribution.

The meaning of the intercept term is the inherent premium of the house after considering all the effects from the independent variables. It can be found by the graph that the reference prices for second-hand housing transactions introduced by the government gives the highest inherent premium to the coast of Nanshan District with the Houhai as the core, and the middle of Futian District with the east shore of Xiangmi Lake as the core. Because of the scarcity of premium locations, the market must be more frantic to capture this information and give higher premiums. In 2020, the highest residential transaction price of these two sites at $50,000-$70,000 per square meter continues to set a new record for housing prices in Shenzhen, while the relatively more marginal residences return to $10,000-$20,000. At this level, the reference price from government succeeds in perceiving the inherent premium distribution and narrowing the gap between inherent premiums. The GNNWR model is similarly able to provide an accurate estimate of the premium inherent in each block based on the reference price.

The property endogenous variables used in this model are MF, AB, NPS, GR and PR, in descending order of influence. The growth of AB has had a restraining effect on house prices in Shenzhen for the most part. The negative correlation between house prices and AB is strongest in the coastal Nanshan District with Houhai as the core, the central Futian District with Xiangmi Lake's eastern shore as the core, and the southern Longhua District with Shenzhen North Station as the core. This may be due to the large supply of quality new houses near these locations, and the relatively old properties are vulnerable to the cold market. At the border of Luohu and Futian districts, the effect of AB on house prices shifts from a negative to a rare positive correlation, i.e. properties here do not have discounts due to old age, but may instead have premiums. According to the research of Goodman et al., the process by which house age affects house prices is nonlinear, with a positive effect on house prices when the age of the house is greater than a certain threshold. [14] In fact, this area explored by the GNNWR model is exactly the area where the earliest construction in Shenzhen took place, and the famous landmarks Dongmen Old Street and Diwang Building are located near this area. The increase of GR can raise the house price, especially for the central Futian District and the central Luohu District, which are located in the prosperous part of the city with higher demand for GR. In the suburbs and along the coast, GR has a smaller effect on raising house prices, and there is even a subtle negative correlation zone in Longgang District. The fluctuation of PR is relatively small, and its impact on the house price is not significant from the sight of average weight. However, there is an area with clear positive correlation between PR and house price, the western part of Luohu District. This is contrary to the general perception, and we believe that it is mainly because the plot ratio there is closely related to the overall appearance of the neighborhood. The western part of Luohu District is the older urban area of Shenzhen, and a low plot ratio tends to represent the old and dilapidated character of the neighborhood, while a high plot ratio tends to be able to correspond to new high-rise housing. Probably for this reason, a positive correlation area appears here, while in other locations it does not.

The environmental variables considered in this model include SD, DSS, NSS, QASP.

There is a very obvious negative correlation between SD and house prices, and the most typical areas include most of Nanshan District, most of Luohu District, etc. In comparison, the negative impact of SD in suburban and inland areas is slightly smaller, including Pingshan District and Guangming District. We speculate that suburban areas farther from the sea have other natural landscape, such as lakes and forests, which partially compensate for the disadvantage of being farther from the sea. Also, here SD is already quite large, and the absolute value of the coefficient need not be large to fully reflect the weakening effect on house prices. It should be noted in particular that in certain inland parcels, there is also a prominent negative SD correlation, such as the southern part of Longgang District and the northern edge of Luohu District. We believe that this is due to the fact that SD here actually characterizes the distance from the core urban area, thus triggering a strong negative correlation. The correlation between DSS and house price fluc-tuates very much. Generally, further away from the subway means lower house price. The regions showing negative correlation include most of the suburbs, especially the southern part of Baoan District and the southern part of Longhua District. In the main urban area, the western part of Luohu District, the central part of Futian District, and the central part of Nanshan District have significant negative correlation between house price and DSS.

However, the areas where DSS is positively correlated are the southeastern part of Futian District, the southern part of Luohu District, and the southern part of Nanshan District. After inspection, a large number of jobs are concentrated in these areas. Southeastern part of Futian District corresponds to the Huaqiang North Market, one of the biggest Cell phone parts distribution markets around the world. The southern part of Luohu District corresponds to Xinxiu Village Industrial Zone. The southern part of Nanshan District corresponds to the area around Shekou Industrial Zone.

From this we have the following inference. On the one hand, for somewhere like residential areas and suburban areas, the closer to the subway entrance, the more convenient the commuting will be, and the price of housing will naturally have a certain increase. On the other hand, directly above the subway entrance, too much movement of people and underground vibration of the subway may have a negative impact on the price of housing. Moreover, for areas with dense subway entrances, CBD or industrial areas, being too close to the subway entrances may have a negative impact on house prices. People who buy houses in this neighborhood are already close to their workplace, and the need to commute with the help of the subway is insignificant; being too close to the subway entrance will instead aggravate the noise and congestion.

The relationship between NSS and house price is that the more subway entrances there are, the higher the house price will be. Specifically for each district, we can find that the distribution of positive and negative correlations is almost opposite to that of DSS in Nanshan District, Futian District and Luohu District. This result confirms our above conjecture.

Surprisingly, the relationship between QASP and house prices is relatively weak. Except for the western part of Luohu District, where house prices are strongly positively correlated with QASP, the effect of QASP on house prices is not significant in all other districts in Shenzhen. We believe this is due to the fact that the western part of Luohu District is the old city of Shenzhen, which makes many old residential areas sell themselves by highlighting its mature school districts, resulting in a strong positive correlation effect. In contrast, the new district's high-quality new housing does not have an established school district, and other factors dominate, the impact of the school district is relatively weaker.

In the study based on Shenzhen house price data, we have ample evidence to prove the superiority of the GNNWR model over the OLS and GWR models. We use a ten-fold validation approach, and the following results are obtained by predicting for onetenth of the data each time. Using RMSE as the standard, GNNWR improved by 13% compared to GWR and 47% compared to OLS. In terms of all other indicators, the GNNWR model shows significant improvement compared to both GWR and OLS. Second, we also performed sufficient tests to demonstrate the robustness of the GNNWR model. The mechanism of ten-fold validation avoids stochastic interference, and tests performed on the test set further demonstrate that the model is fully valid. In the section on hypothesis testing, we analyzed the significance of spatial heterogeneity. Compared with the GWR model, which also models spatial heterogeneity, and judged with the help of the AICc metric of the training set, it can be found that the improvement in fitting accuracy of the GNNWR model compared with the GWR model is much greater than the improvement in the complexity of the spatial weight matrix. All of these analyses clearly show that GNNWR has good robustness. Finally, we also analyze the spatial heterogeneity explored by GNNWR, which corroborates the outstanding information mining ability of GNNWR model in the context of Shenzhen.

This study focuses on the following innovations. First, GWR, as a relatively traditional modeling method for spatial analysis, commonly used kernel functions only have two choices of bi-square and Gauss. Therefore, the calculated spatial weight matrix often does not adequately reflect the dataset characteristics, which is the original intention of GNNWR being proposed. Second, currently, other studies on modeling house prices with the help of neural networks, hardly introduce a ten-fold validation mechanism. This is a serious problem, and this study was refined based on more mature experimental specifications for neural networks. Third, as some scholars have suggested, the "black box" approach of neural networks has significantly limited the practical significance of neural networks in predicting house prices. [21] Both polynomial regression models and traditional neural network methods depart from the linear structure and have relatively complex expressions, making the analysis and prediction much more difficult. Fourth, neural network prediction methods, that take less geographical location information into account, make their performance unstable. A part of the study highlighted the accuracy of neural network prediction compared to OLS models, [18, 24, 25] but some studies concluded that neural network models often fail to significantly outperform OLS model and its improved models (including hedonic models that correct the dependent variable by log and polynomial regression model). [22, 28] But in any case, the RMSEbased metrics show that there are few neural network models with more than 30% improvement compared to OLS. Since the GNNWR model was proposed, there is no relevant applied research in the socioeconomic field, and this study fills this blank. Future improvements can be made in the following directions. First, the error term in the linear model can be further tested for the linearity, homoscedasticity, independence and normality properties. If they are not satisfied, the dependent variable can be pretreated using the Box-Cox method. Second, more independent variables can be obtained to further expand the choice of independent variables. Third, the independent variables can be preprocessed and filtered. For example, if three independent variables, total number of build-ings, total number of apartments, and total number of units, are obtained for a certain residential quarter, these independent variables will have multicollinearity. Using PCA, the principal components of these independent variables can be extracted and the multicollinearity can be reduced. Another example is to use more data, such as enrollment rate, distance to school, to evaluate a school district. The independent variables can also be filtered using the forward method, backward method or stepwise method. Fourth, comparable tests can be further performed on other data sets or data from multiple cities can be collected to build a house price prediction benchmark. Fifth, time series analysis can be added to make the GNNWR model have the function of prediction in both time and space.

Analysis of spatial autocorrelation in house prices

Some notes on parametric significance tests for geographically weighted regression

Geographically weighted regression: a method for exploring spatial nonstationarity

The specification of hedonic indexes for urban housing

Geographically neural network weighted regression for the accurate estimation of spatial non-stationarity

Geographically weighted regression: the analysis of spatially varying relationships

Measuring spatial variations in relationships with geographically weighted regression. In Recent developments in spatial analysis

Geographically weighted regression model (gwr) based spatial analysis of house price in shenzhen

A hedonic urban land price index

Age-related heteroskedasticity in hedonic house price equations

Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification

Statistical tests for spatial nonstationarity based on the geographically weighted regression model

House price prediction: hedonic price model vs. artificial neural network

Effectiveness comparison of the residential property mass appraisal methodologies in the usa

Geographically weighted regression using a non-euclidean distance metric with a study on london house price data

Geographically weighted regression with a non-euclidean distance metric: a case study using hedonic house price data

The potential of artificial neural networks in mass appraisal: the case revisited

Neural networks: the prediction of residential values

A hedonic price model for private properties in hong kong

Predicting housing value: A comparison of multiple regression analysis and artificial neural networks

Neural network hedonic pricing models in mass real estate appraisal

Housing price prediction using machine learning algorithms: The case of melbourne city, australia

Hedonic prices and implicit markets: product differentiation in pure competition

Application of artificial neural networks to the valuation of residential property. In Third Annual Pacific-Rim Real Estate Society Conference

Determinants of house prices in turkey: Hedonic regression versus artificial neural network

Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research

House price prediction modeling using machine learning

A computer movie simulating urban growth in the detroit region

The theory and method of geographically and temporally neural network weighted regression

Modeling spatially anisotropic nonstationary processes in coastal environments based on a directional geographically neural network weighted regression

Gnnwr: an effective method for analyzing and predicting spatial nonstationarity by combining deep neural networks and ordinary least squares

Exploring housing rent by mixed geographically weighted regression: A case study in nanjing