Submitted 13 May 2020 Accepted 18 June 2020 Published 13 July 2020 Corresponding author Thomas R. Etherington, ethering- tont@landcareresearch.co.nz Academic editor Tianfeng Chai Additional Information and Declarations can be found on page 13 DOI 10.7717/peerj-cs.282 Copyright 2020 Etherington Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Discrete natural neighbour interpolation with uncertainty using cross-validation error-distance fields Thomas R. Etherington Manaaki Whenua—Landcare Research, Lincoln, New Zealand ABSTRACT Interpolation techniques provide a method to convert point data of a geographic phenomenon into a continuous field estimate of that phenomenon, and have become a fundamental geocomputational technique of spatial and geographical analysts. Natural neighbour interpolation is one method of interpolation that has several useful proper- ties: it is an exact interpolator, it creates a smooth surface free of any discontinuities, it is a local method, is spatially adaptive, requires no statistical assumptions, can be applied to small datasets, and is parameter free. However, as with any interpolation method, there will be uncertainty in how well the interpolated field values reflect actual phenomenon values. Using a method based on natural neighbour distance based rates of error calculated for data points via cross-validation, a cross-validation error-distance field can be produced to associate uncertainty with the interpolation. Virtual geography experiments demonstrate that given an appropriate number of data points and spatial-autocorrelation of the phenomenon being interpolated, the natural neighbour interpolation and cross-validation error-distance fields provide reliable estimates of value and error within the convex hull of the data points. While this method does not replace the need for analysts to use sound judgement in their interpolations, for those researchers for whom natural neighbour interpolation is the best interpolation option the method presented provides a way to assess the uncertainty associated with natural neighbour interpolations. Subjects Computational Science, Spatial and Geographic Information Science Keywords Convex hull, Digital, Neighbor, Python, Raster, Sibson, Virtual geography experi- ments, Voronoi diagram INTRODUCTION Spatially continuous geographic phenomena are often only measured at point locations. Interpolation techniques provide a method to convert such point data into a continuous estimate of the phenomenon, and have become a fundamental computational technique of spatial and geographical analysts with key texts devoting large sections to interpolation methods (Burrough & McDonnell, 1998; O’Sullivan & Unwin, 2010; Slocum et al., 2014). Natural neighbour (or Sibson) interpolation is an interpolation technique that was first presented by Sibson (1981). The method is based upon a Voronoi (or: Dirichlet, Thiessen) diagram that partitions space to identify those areas that are closest to a set of points (Okabe et al., 2000). Previous authors (Sambridge, Braun & McQueen, 1995; Watson, 1999) have noted several useful properties of natural neighbour interpolation: (i) the method is an How to cite this article Etherington TR. 2020. Discrete natural neighbour interpolation with uncertainty using cross-validation error- distance fields. PeerJ Comput. Sci. 6:e282 http://doi.org/10.7717/peerj-cs.282 https://peerj.com/computer-science mailto:etheringtont@landcareresearch.co.nz mailto:etheringtont@landcareresearch.co.nz https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.282 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.282 exact interpolator, in that the original data values are retained at the reference data points; (ii) the method creates a smooth surface free from any discontinuities; (iii) the method is entirely local, as it is based on a minimal subset of data locations that excludes locations that, while close, are more distant than another location in a similar direction; and (iv) the method is spatially adaptive, automatically adapting to local variation in data density or spatial arrangement. To this list I would add: (v) there is no requirement to make statistical assumptions; (vi) the method can be applied to very small datasets as it is not statistically based; and (vii) the method is parameter free, so no input parameters that will affect the success of the interpolation need to be specified. These properties make natural neighbour interpolation particularly well suited for the interpolation of continuous geographic phenomena from data points that have a highly irregular spatial distribution. While the choice of an appropriate interpolation method will always vary on a case by case basis, studies comparing interpolation methodologies with climate and land surface data demonstrate that natural neighbour interpolation is a highly competitive and sometimes optimal technique (Abramov & McEwan, 2004; Bater & Coops, 2009; Hofstra et al., 2008; Lyra et al., 2018; Yilmaz, 2007). Unfortunately, natural neighbour interpolation can be relatively slow in comparison to other methods (Abramov & McEwan, 2004). The high computational cost arises from the need to insert a new point into the Voronoi diagram for every cell that will make up the interpolation field, and this geometric process becomes increasingly difficult in higher dimensions (Park et al., 2006). This has led to the development of discrete (or digital) natural neighbour interpolation that is significantly quicker than traditional approaches (Park et al., 2006) and has been applied successfully in a geographical context (Keller et al., 2015). While natural neighbour interpolation has various useful properties, and the discrete form is computationally scalable, there is a great deal of uncertainty associated with any interpolation. Therefore, being able to associate interpolation estimates with some form of uncertainty would be highly desirable. Previous efforts for natural neighbour interpolation have been based upon fitting statistical uncertainty models (Bater & Coops, 2009; Ghosh, Gelfrand & Mlhave, 2012), but this approach is contrary to natural neighbour interpolation’s useful properties (v), (vi), and (vii). Therefore, for those researchers who decide that for their data and objectives natural neighbour interpolation is the best interpolation option, I present an approach to associate the interpolation with a measure of uncertainty that is consistent with all the useful properties of natural neighbour interpolation. MATERIALS & METHODS Discrete natural neighbour interpolation In the 2-dimensional planar context that is most relevant to geographical applications, discrete natural neighbour interpolation begins by calculating a discrete Voronoi diagram. First, a raster spatial domain C of cells c is defined such that c ∈C ⊂R2 and hence each c has coordinate attributes x,y for its centre so all ci={xi,yi}. Etherington (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.282 2/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.282 The data points are then used to define a set P of n data cells P ={p1,p2,p3,...,pn} where P ∈C, and each data cell has coordinate attributes for its cell centre x,y and value z, so pi={xi,yi,zi}. When multiple data points occur within a raster cell, the resulting data cell has a value z that is the mean of all the data point values. The discrete Voronoi polygon V (pi) that contains all the cells that are closest to each data cell can then be defined as V (pi)={c ∈C|d(c→pi)20 and h∼>1.0 (Fig. 4B) such criteria cannot be easily applied by an analyst as while n is known h is unknown and in many situations will be hard to guess. Fortunately, while the cross-validation MAE that can always be calculated by an analyst is generally slightly higher than the e(ẑ) there is still a strong correlation between the two variables (Fig. 6D), and this correlation is extremely useful as it indicates to an analyst the likely levels of e(ẑ) and therefore e(ê) too. A comparison of e(ẑ) and e(ê) inside and outside of the convex hull around the sampling points clearly shows that while the performance follows a similar trend e(ẑ) and e(ê) can be expected to be higher outside of the convex hull (Figs. 6E–6F). Etherington (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.282 10/16 https://peerj.com https://doi.org/10.7717/peerjcs.282/fig-5 http://dx.doi.org/10.7717/peerj-cs.282 Figure 6 Performance of natural neighbour interpolation and cross-validation error-distance fields from 500 virtual geography experiments. The mean absolute error (MAE) of cells within the convex hull around sampling points for different experimental combinations of the number n of random sampling points and the spatial-autocorrelation h of virtual phenomena fields for (A) the value errors e(ẑ) from the natural neighbour interpolations and (B) the error of errors e(ê) from the cross-validation error-distance fields that (C) were highly correlated. (D) Comparison of e(ẑ) and the cross-validation MAE derived from the sampling points. Comparison of interpolation performance inside and outside the convex hull around the sampling points for (E) e(ẑ) and (F) e(ê). Full-size DOI: 10.7717/peerjcs.282/fig-6 Etherington (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.282 11/16 https://peerj.com https://doi.org/10.7717/peerjcs.282/fig-6 http://dx.doi.org/10.7717/peerj-cs.282 DISCUSSION The virtual geography experiments indicate that under suitable conditions the natural neighbour interpolation field and the cross-validation error-distance field should provide useful estimates of a geographic phenomenon field with associated uncertainty. The fact that the cross-validation error-distance field reflects localised changes in the spatial distribution of both the underlying phenomenon and the point data is particularly useful, and contrasts with other spatial interpolation uncertainty methods such as MAE and RMSE that estimate error using a global approach. The virtual geography experiments demonstrated that the performance of natural neighbour interpolation will be lower outside of the convex hull around the data points, as is expected (Watson, 1999)—although this is also likely to be true of all spatial interpolation techniques as beyond the convex hull interpolation becomes extrapolation. However, we do not suggest that interpolation should be restricted to within the convex hull as there may be occasions where the area of interest may occur slightly outside the convex hull. For example, when interpolating rainfall data from weather stations that are usually sited in settlements, there are likely to be areas of coastline along peninsulas and headlands that will not fall within a convex hull around the weather stations (Lyra et al., 2018). Therefore, it is logistically useful that discrete natural neighbour interpolation can produce estimated values beyond the convex hull of the available data points. What is helpful in this context is that the cross-validation error-distance field incorporates information on distance from data points, therefore as interpolations move further beyond the convex hull the error-field should increase to help to guard against erroneous estimates. However, the responsibility of appropriate use of natural neighbour interpolation still belongs with the spatial analyst who must make decisions about whether interpolation is useful based on their knowledge of: the expected spatial-autocorrelation of the phenomenon being interpolated, the number and distribution of data points, the location of the areas for which interpolations are required, and the magnitude of the estimated errors in relation to the magnitude of the value estimates. And of course, the cross-validation error-distance field only captures uncertainty in the interpolation itself and does not incorporate any uncertainty that may arise from the data itself. While I have argued against the use of the cross-validation MAE as a measure of uncertainty, I would recommend that analysts continue to calculate the cross-validation MAE given its strong correlation with the performance of the natural neighbour interpolation, and therefore the performance of the cross-validation error-distance field too. Analysts can then use the cross-validation MAE as a helpful guide when deciding if interpolation is advisable or not. When doing so it is important to remember that as the cross-validation MAE is based on the use of n−1 data cells, the error estimates may be slightly higher than the real errors that would be based on all n data that is ultimately used in the interpolation (Willmott & Matsuura, 2006). Therefore, the cross-validation MAE should be seen as a slightly conservative indication of likely interpolation performance. Etherington (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.282 12/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.282 CONCLUSION For those researchers for whom natural neighbour interpolation is the best interpolation option, the cross-validation error-distance field method presented provides a way to assess the uncertainty associated with natural neighbour interpolations that is consistent with the useful properties of natural neighbour interpolation. While the cross-validation error-distance method has been described here in the context of discrete natural neighbour interpolation, there is no reason why this same approach could not be applied to geometric natural neighbour interpolation as well. Discrete natural neighbour interpolation has been implemented here in two-dimensional space for ease of visualisation, but the method will generalise to higher dimensions (Park et al., 2006) and in principle I cannot see any reason why the uncertainty method presented could not also be applied in higher dimensions by those who wish to do so. The approach could easily be adapted to other interpolation methods, as all that is required is a measure of weighted distances to the data points creating the interpolation. Given the promise of the algorithm, and to encourage its use and development, the Python code used to generate the examples presented is freely available under the permissive MIT License as supplementary material. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by the Strategic Science Investment Funding for Crown Research Institutes from the New Zealand Ministry of Business, Innovation and Employment’s Science and Innovation Group. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the author: Strategic Science Investment Funding for Crown Research Institutes from the New Zealand Ministry of Business, Innovation and Employment’s Science and Innovation Group. Competing Interests Thomas R. Etherington is employed by Manaaki Whenua-Landcare Research, and declares that there are no competing interests. Author Contributions • Thomas R. Etherington conceived and designed the experiments, performed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: Python scripts to reproduce the examples, virtual experiments, and figures are available as a Supplementary File. Etherington (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.282 13/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.282#supplemental-information http://dx.doi.org/10.7717/peerj-cs.282 Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.282#supplemental-information. REFERENCES Abramov O, McEwan A. 2004. An evaluation of interpolation methods for Mars Orbiter Laser Altimeter (MOLA) data. International Journal of Remote Sensing 25(3):669–676 DOI 10.1080/01431160310001599006. Bater CW, Coops NC. 2009. Evaluating error associated with lidar-derived DEM interpo- lation. Computers & Geosciences 35(2):289–300 DOI 10.1016/j.cageo.2008.09.001. Burrough PA, McDonnell RA. 1998. Principles of geographical information systems. Oxford: Oxford University Press. Etherington TR, Holland EP, O’Sullivan D. 2015. NLMpy: a Python software package for the creation of neutral landscape models within a general numerical framework. Methods in Ecology and Evolution 6(2):164–168 DOI 10.1111/2041-210X.12308. Fournier A, Fussell D, Carpenter L. 1982. Computer rendering of stochastic models. Communications of the ACM 25(6):371–384 DOI 10.1145/358523.358553. Ghosh S, Gelfrand AE, Mlhave T. 2012. Attaching uncertainty to deterministic spatial in- terpolations. Statistical Methodology 9(1–2):251–264 DOI 10.1016/j.stamet.2011.06.001. Hofstra N, Haylock M, New M, Jones P, Frei C. 2008. Comparison of six methods for the interpolation of daily, European climate data. Journal of Geophysical Research 113(D21):D21110 DOI 10.1029/2008JD010100. Hunter JD. 2007. Matplotlib: a 2D graphics environment. Computing in Science & Engineering 9(3):90–95 DOI 10.1109/MCSE.2007.55. Keller VDJ, Tanguy M, Prosdocimi I, Terry JA, Hitt O, Cole SJ, Fry M, Morris DG, Dixon H. 2015. CEH-GEAR: 1 km resolution daily and monthly areal rainfall estimates for the UK for hydrological and other applications. Earth System Science Data 7(1):143–155 DOI 10.5194/essd-7-143-2015. Lyra GB, Correia TP, De Oliveira-Jnior JF, Zeri M. 2018. Evaluation of methods of spatial interpolation for monthly rainfall data over the state of Rio de Janeiro, Brazil. Theoretical and Applied Climatology 134(3):955–965 DOI 10.1007/s00704-017-2322-3. Okabe A, Boots B, Sugihara K, Chiu SN. 2000. Spatial Tessellations: concepts and applications of Voronoi diagrams. 2nd edition. Chichester: John Wiley & Sons. O’Sullivan D, Unwin DJ. 2010. Geographic information analysis. 2nd edition. Hoboken: John Wiley & Sons. Park SW, Linsen L, Kreylos O, Owens JD, Hamann B. 2006. Discrete Sibson interpo- lation. IEEE Transactions on Visualization and Computer Graphics 12(2):243–253 DOI 10.1109/TVCG.2006.27. Pérez F, Granger BE, Hunter JD. 2011. Python: an ecosystem for scientific computing. Computing in Science & Engineering 13(2):13–21 DOI 10.1109/MCSE.2010.119. Etherington (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.282 14/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.282#supplemental-information http://dx.doi.org/10.7717/peerj-cs.282#supplemental-information http://dx.doi.org/10.1080/01431160310001599006 http://dx.doi.org/10.1016/j.cageo.2008.09.001 http://dx.doi.org/10.1111/2041-210X.12308 http://dx.doi.org/10.1145/358523.358553 http://dx.doi.org/10.1016/j.stamet.2011.06.001 http://dx.doi.org/10.1029/2008JD010100 http://dx.doi.org/10.1109/MCSE.2007.55 http://dx.doi.org/10.5194/essd-7-143-2015 http://dx.doi.org/10.1007/s00704-017-2322-3 http://dx.doi.org/10.1109/TVCG.2006.27 http://dx.doi.org/10.1109/MCSE.2010.119 http://dx.doi.org/10.7717/peerj-cs.282 Sambridge M, Braun J, McQueen H. 1995. Geophysical parametrization and interpo- lation of irregular data using natural neighbours. Geophysical Journal International 122(3):837–857 DOI 10.1111/j.1365-246X.1995.tb06841.x. Sibson R. 1981. A brief description of natural neighbour interpolation. In: Barnett V, ed. Interpreting multivariate data. Chichester: Wiley, 21–36. Slocum TA, McMaster RB, Kessler FC, Howard HH. 2014. Thematic cartography and geovisualization. Harlow: Pearson Education Limited. Tobler W. 1970. A computer movie simulating urban growth in the Detroit region. Economic Geography 46(2):234–240 DOI 10.2307/143141. Tomlin CD. 1990. Geographic information systems and cartographic modeling. Englewood Cliffs: Prentice-Hall. Van der Walt S, Colbert SC, Varoquaux G. 2011. The NumPy array: a structure for efficient numerical computation. Computing in Science & Engineering 13(2):22–30 DOI 10.1109/MCSE.2011.37. Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, Burovski E, Peterson P, Weckesser W, Bright J, van der Walt SJ, Brett M, Wilson J, Millman KJ, Mayorov N, Nelson ARJ, Jones E, Kern R, Larson E, Carey CJ, Polat I, Feng Y, Moore EW, VanderPlas J, Laxalde D, Perktold J, Cimrman R, Henriksen I, Quintero EA, Harris CR, Archibald AM, Ribeiro AH, Pedregosa F, van Mulbregt P, Vijaykumar A, Bardelli AP, Rothberg A, Hilboll A, Kloeckner A, Scopatz A, Lee A, Rokem A, Woods CN, Fulton C, Masson C, Häggström C, Fitzgerald C, Nicholson DA, Hagen DR, Pasechnik DV, Olivetti E, Martin E, Wieser E, Silva F, Lenders F, Wilhelm F, Young G, Price GA, Ingold G-L, Allen GE, Lee GR, Audren H, Probst I, Dietrich JP, Silterra J, Webber JT, Slavi J, Nothman J, Buchner J, Kulick J, Schnberger JL, de Miranda Cardoso JV, Reimer J, Harrington J, Rodrguez JLC, Nunez-Iglesias J, Kuczynski J, Tritz K, Thoma M, Newville M, Kmmerer M, Bolingbroke M, Tartre M, Pak M, Smith NJ, Nowaczyk N, Shebanov N, Pavlyk O, Brodtkorb PA, Lee P, McGibbon RT, Feldbauer R, Lewis S, Tygier S, Sievert S, Vigna S, Peterson S, More S, Pudlik T, Oshima T, Pingel TJ, Robitaille TP, Spura T, Jones TR, Cera T, Leslie T, Zito T, Krauss T, Upadhyay U, Halchenko YO, Vázquez-Baeza Y. 2020. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nature Methods 17(3):261–272 DOI 10.1038/s41592-019-0686-2. Watson D. 1999. The natural neighbor series manuals and source codes. Computers & Geosciences 25(4):463–466 DOI 10.1016/S0098-3004(98)00150-2. Willmott CJ, Matsuura K. 2005. Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Climate Research 30(1):79–82 DOI 10.3354/cr030079. Willmott CJ, Matsuura K. 2006. On the use of dimensioned measures of error to eval- uate the performance of spatial interpolators. International Journal of Geographical Information Science 20(1):89–102 DOI 10.1080/13658810500286976. Yilmaz HM. 2007. The effect of interpolation methods in surface definition: an experimental study. Earth Surface Processes and Landforms 32(9):1346–1361 DOI 10.1002/esp.1473. Etherington (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.282 15/16 https://peerj.com http://dx.doi.org/10.1111/j.1365-246X.1995.tb06841.x http://dx.doi.org/10.2307/143141 http://dx.doi.org/10.1109/MCSE.2011.37 http://dx.doi.org/10.1038/s41592-019-0686-2 http://dx.doi.org/10.1016/S0098-3004(98)00150-2 http://dx.doi.org/10.3354/cr030079 http://dx.doi.org/10.1080/13658810500286976 http://dx.doi.org/10.1002/esp.1473 http://dx.doi.org/10.7717/peerj-cs.282 Zhang J, Goodchild M. 2002. Uncertainty in Geographical Information. London: Taylor & Francis DOI 10.1201/b12624. Etherington (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.282 16/16 https://peerj.com http://dx.doi.org/10.1201/b12624 http://dx.doi.org/10.7717/peerj-cs.282