key: cord-0185317-gfnezmtn authors: Rhodes, Jake S.; Cutler, Adele; Moon, Kevin R. title: Geometry- and Accuracy-Preserving Random Forest Proximities date: 2022-01-29 journal: nan DOI: nan sha: f3d4395ec6b57e804e3d629abfcb387abe80ff7f doc_id: 185317 cord_uid: gfnezmtn Random forests are considered one of the best out-of-the-box classification and regression algorithms due to their high level of predictive performance with relatively little tuning. Pairwise proximities can be computed from a trained random forest which measure the similarity between data points relative to the supervised task. Random forest proximities have been used in many applications including the identification of variable importance, data imputation, outlier detection, and data visualization. However, existing definitions of random forest proximities do not accurately reflect the data geometry learned by the random forest. In this paper, we introduce a novel definition of random forest proximities called Random Forest-Geometry- and Accuracy-Preserving proximities (RF-GAP). We prove that the proximity-weighted sum (regression) or majority vote (classification) using RF-GAP exactly match the out-of-bag random forest prediction, thus capturing the data geometry learned by the random forest. We empirically show that this improved geometric representation outperforms traditional random forest proximities in tasks such as data imputation and provides outlier detection and visualization results consistent with the learned data geometry. R ANDOM forests [1] are well-known, powerful predictors comprised of an ensemble of binary recursive decision trees. Random forests are easily adapted for both classification and regression, are trivially parallelizable, can handle mixed variable types (continuous and categorical), are unaffected by monotonic transformations, are insensitive to outliers, scale to small and large datasets, handle missing values, are capable of modeling non-linear interactions, and are robust to noise variables [1] , [2] . Random forests are simple to use, produce good results with little to no tuning, and can be applied to a wide-variety of fields. Citing some recent examples, Benali et al. [3] demonstrated superior solar radiation prediction using random forests over neural networks. The work of [4] showed that random forests produced the best results with lowest standard errors in classifying error types in rotating machinery when compared with more commonly used models in this application, such as the SVM and neural networks. Other recent successes include patient health prediction after exposure to COVID-19 [5] , spatio-temporal COVID-19 case estimation [6] , landslide susceptibility mapping [7] , infectious diarrhea forecasting [8] , cardiovascular disease prediction [9] , deforestation rate prediction [10] , nanofluid viscosity estimation [11] , rural credit assessment [12] , RNA pseudouridine site prediction [13] , wearable-sensor activity classification [14] , and heavy metal distribution estimation in agricultural soils [15] . In addition to their high predictive-power, random forests have a natural extension to produce pair-wise proximity (similarity) measures determined by the partitioning space of the decision trees which comprise them. The ran- dom forest proximity measure between two observations was first defined by Leo Breiman as the proportion of trees in which the observations reside in the same terminal node [16] . As splitting variable values are determined to optimize partitions according to the supervised task, these proximities encode a supervised similarity measure. Many machine learning methods depend on some measure of pairwise similarity (which is usually unsupervised) including dimensionality reduction methods [17] , [18] , [19] , [20] , [21] , [22] , [23] , spectral clustering [24] , and any method involving the kernel trick such as SVM [25] and kernel PCA [26] . Random forest proximities can be used to extend many of these problems to a supervised setting and have been used for data visualization [27] , [28] , [29] , [30] , [31] , outlier detection [30] , [32] , [33] , [34] , and data imputation [35] , [36] , [37] , [38] . See Section 2 for further discussion of random forest proximity applications. Unlike unsupervised similarity measures, random forest proximities incorporate variable importance relative to the supervised task as these variables are more likely to be used in determining splits in the decision trees [2] . Ideally, random forest proximities should define a data geometry that is consistent with the learned random forest; that is, the random forest predictive ability should be recoverable from the proximities. In this case, applications involving random forest proximities, such as data visualization, can lead to improved interpretability of the random forests specifically, and more generally the data geometry relative to the supervised task. One way to test for consistency is to compute a proximity-weighted predictor where a data point's predicted label consists of a proximity-weighted sum of the labels of all other points. This predictor should match the random forest prediction if the proximities are consistent with the random forest. However under Breiman's original definition, the proximity-weighed predictions do not match those of the random forest, even when applied to the training data (see Section 5) . Thus this definition does not capture the data geometry learned by the random forest, limiting its potential for improved interpretability of the random forest. We define a new random forest proximity measure called Random Forest-Geometry-and Accuracy-Preserving proximities (RF-GAP) that defines a data geometry such that the proximity-weighted predictions exactly match those of the random forest for both regression and classification. Under our definition, an out-of-bag observation's proximities are computed via in-bag (training) observations. That is, the sample used to generate a decision tree also generates the proximities of out-of-bag observations (observations not used to construct the tree). The proximities of an out-ofbag observation is the mean reciprocal of the number of inbag observation in its shared terminal nodes. We prove the equivalence between the proximity-weighted predictions with those of the random forest and demonstrate this empirically. We then compare RF-GAP proximities with existing random forest proximities in common applications: data imputation, visualization, and outlier detection. In all cases, RF-GAP outperforms existing definitions, showing that it better captures the geometry learned by the random forest. Random forest proximities have been used in many applications. One common usage is visualization or clustering. Unsupervised random-forest clustering was introduced by Shi and Horvath in [39] . In this paper, they applied both classical and metric multi-dimensional scaling (MDS) to a number of datasets. Pouyan et al. used unsupervised random forest proximity matrices to generate 2D plots for visualization using t-SNE [27] . This approach was used to visualize two Mass cytometry data sets and provided clearer visual results over other compared distance metrics (Euclidean, Chebyshev, and Canberra). Similarly, [29] used an unsupervised random forest for clustering tumor profilings and showed improvement over other common clustering algorithms with microarray data. Random forest proximities have also been used in the supervised setting for visualizing the data, typically via MDS [30] , [40] , but other approaches have been used [31] . However, these approaches use a proximity definition which does not match the data geometry learned by the random forest. Thus the visualizations do not give a true representation of this geometry. Random forest proximities have also been used in outlier detection in a supervised setting. An observation's outlier score is typically defined to be inversely proportional to its average within-class proximity measure (see Section 6.3 for details). This approach was used to achieve better random forest error rates in both classification and regression in pathway analysis [30] . Nesa et al. demonstrated the superiority of random forests for detecting errors and events across internet of things (IOT) device sensors over other multivariate outlier detection methods [41] . The random forest outlier detection algorithm has been effective in other contexts such as modeling species distribution [42] , detecting food adulteration via infrared spectroscopy [43] , predicting galaxy spectral measurements [44] , and detecting network anomalies [45] . Random forest proximities are used to impute missing data by replacing missing values of a given variable with a proximity-weighted sum of non-missing values. Pantanowitz and Marwala used this approach to impute missing data in the context of HIV seroprevalence [37] . They compared their results with five additional imputation methods, including neural networks, and random forestneural network hybrids and concluded that random forest imputation produced the most accurate results with the lowest standard errors. Shah et al. compare random forest imputation with multivariate imputation by chained equations (MICE [46] ) on cardiovascular health records data [38] . They showed that random forest imputation methods typically produced more accurate results and that in some circumstances MICE gave biased results under default parameterization. In [35] it was similarly shown that random forests produced the most accurate imputation in a comprehensive metabolomics imputation study. Variable importance assessment is another application of random forest proximities. One approach is to use differences in proximity measures as a criterion for assessing variable permutation importance [47] . Variable selection criteria were compared across various data sets and some improvement over existing variable importance measures was achieved. In [48] , feature contributions to the random forest decision space (defined by proximities) are explored. While many measures of variable importance are generally computed at a global level, the author's propose a bit-wise, permutation-based feature importance which captures both the contribution (influence of the feature in the decision space) and closeness (position in the decision space relative to the in-and out-class), giving further insight to each features contribution at the terminal node level. Multi-modal/multiview problems can also be approached using random forest proximities. Gray et al. additively combined random forest proximities from different modalities or views (FDG-PET and MR imaging) of persons with Alzheimer's disease or with mild cognitive impairment. They applied MDS to the combined proximities to create an embedding used for classification. Classification on the multi-modal embedding showed significantly better results than classification on both modes separately [28] . Cao, Bernard, Sabourin, and Heutte explored variations of proximity-based classification techniques in the context of multi-view radiomics problems [49] . They compared with the work from [28] and joined proximity matrices using linear combinations. The authors explored random forest parameters to determine the quality of the proximity matrices. They concluded that a large number of maximum-depth trees produced the best quality proximities, quantified using a one-nearest neighbor classifier. Using our proposed random forest proximity measure which accurately reflects the random forest predictions from each view may add to the success of this method, thus creating a truer forest ensemble for multi-view learning. Seoane et al. proposed the use of random forest proximities to measure gene annotations with some improvement in precision over other existing methods [50] . Zhao et al. presented two matching methods for observational studies, one propensity-based and the other random forestproximity-based [51] . For the random-forest approach, sub-jects of different classes were iteratively matched based on nearest proximity values. They showed that the proximitybased matching was superior to the propensity matching in addition to other existing matching techniques. All of these applications used existing definitions of random forest proximities that do not match the data geometry learned by the random forest. In contrast, RF-GAP accurately reflects this geometry. Thus, the applications presented here should show improvement using our definition. Indeed, we show experimentally in Sections 5 that using RG-GAP gives data visualizations that more accurately represent the geometry learned from the random forests, outlier scores that are more reflective of the random forest's learning, and improved random forest imputations. Here we provide several existing definitions of random forest proximities followed by RF-GAP that preserves the geometry learned by the random forest. Let M = {(x 1 , y 1 ), (x 2 , y 2 ), · · · (x N , y N )} be the training data where each x i ∈ X , is a d-dimensional vector of predictor variables with corresponding response, y i ∈ Y. We will use the following notation (see Fig. 1 for a visual example): • T is the set of decision trees in a random forest with |T | = T . • B(t) is the multiset of indices in the bootstrap sample of the training data that is randomly selected to train the tree t ∈ T . Thus B(t) contains the indices of the in-bag observations. is the set of indices of the training data that are not contained in B(t). O(t) is often referred to as the out-of-bag (OOB) sample. • S i = {t ∈ T |i ∈ O(t)}. This is the set of trees in which the observation i is OOB. • v i (t) contains the indices of all observations that end up in the same terminal node as x i in tree t. that correspond with the in-bag observations of t. I.e. these are the observations that are in-bag and end up in the same terminal node as x i . Each decision tree t in a random forest is grown by recursively partitioning (splitting) the bootstrap sample into nodes, where splits are determined across a subset of feature variables to maximize purity (classification) or minimize the mean squares of the residuals (regression) in the resulting nodes. This process repeats until a stopping criterion is met. For classification, splits are typically continued until nodes are pure (one class). For regression, a common stopping criterion is a predetermined minimum node size (e.g. 5). The trees in random forests are typically not pruned. OOB samples are commonly used to provide an unbiased estimate of the forest's generalization error rate [1] . The strength of the random forest is highly dependent on the predictive power of the individual decision trees (base learners) and on low correlation between the decision trees [16] , [52] . In addition to bootstrap sampling, further correlational decrease between trees is ensured by selecting a random subset of predictor variables at each node for split optimization. The number of variables to be considered is Fig. 1 . An example of a random forest and notation with regards to a particular observation x 1 . The red-encircled trees are those in which x 1 is out of bag, making up the set of trees S 1 . A particular tree in S 1 is exhibited. The out-of-bag indices for the tree are given in red (i ∈ O(t)), while the in-bag indices (i ∈ B(t)) are shown in black. The indices of observations residing in the same terminal node as x 1 is given by the set v 1 (t). J 1 (t) gives the in-bag observation indices in the terminal node v 1 (t). designated by the parameter mtry in many random forest packages [53] , [54] . The resulting terminal nodes partition the input space X . This partition is often used in defining random forest proximities as in Breiman's original definition: [16] ). The random forest proximity between observations i and j is given by: where T is the number of trees in the forest, v i (t) contains the indices of observations that end up in the same terminal node as x i in tree t, and I(.) is the indicator function. That is, the proximity between observations i and j is the proportion of trees in which they reside in the same terminal node, regardless of bootstrap status. Definition 1 (the original definition) does not capture the data geometry learned by the random forest as it does not take an observation's bootstrap status (whether or not the observation was used in the training of any particular tree) into account in the proximity calculation: both in-bag and out-of-bag samples are used. In-bag observations of different classes will necessarily terminate in different nodes (as trees are grown until pure). Thus this produces an overexaggerated class separation in the proximity values. Despite its weaknesses, this definition has been used for outlier detection, data imputation, and visualization. However, these applications may produce misleading results as this definition tends to overfit the training data, quantified by low error rates as proximity-weighted predictors. One attempt to overcome this issue redefines the proximity measure between observations i and j using only trees in which both observations are out-of-bag (OOB proximities): Definition 2 (OOB Proximity [53] , [55] ). The OOB proximity between observations with indices i and j is given by: where O(t) denotes the set of indices of the observations that are out-of-bag in tree t, and S i is the set of trees for which observation i is out-of-bag. In other words, this proximity measures the proportion of trees in which observations i and j reside in the same terminal node, both being out-of-bag. Definition 2 [55] is currently used in the randomForest [53] package by Liaw and Wiener in the R programming language [56] . It may have been inadvertently used in papers which used this package but none made explicit mention of the use of OOB observations in building proximities. However, we find that this definition also does not characterize the random forest and generally produces higher error rates as a proximityweighted predictor (when compared to the random forest's OOB error rate; see Section 5). Several alternative random forest proximity measures beyond Definitions 1 and 2 have been proposed previously. In each case, source code has not been provided. In [57] , the authors define a proximity-based kernel (PBK) which accounts for the number of branches between observations in each decision tree, defining the proximity between i and j as p P BK (i, j) = 1 T T t=1 1 e w·g ijt , where T is the number of trees in the forest, w is a user-defined parameter, and g ijt is the number of branches between observations i and j in tree t. g is defined to be 0 if the observations reside in the same terminal node. The proximity quality was quantified using the classification accuracy when applied as a kernel in a support-vector machine. This definition showed some improvement over the original definition (Definition 1) only when considering very small numbers of trees (5 or 10). However, PBK is computationally expensive as all pair-wise branch distances must be computed within each tree. This is not an issue for small numbers of trees, but typically the number of trees in a random forest is measured in hundreds or thousands (randomForest [53] has a default of 500 trees). Additionally, this method adds a user-defined, tunable parameter which adds to its complexity. The authors in [58] describe an approach for computing random forest proximities in the context of a larger class of Random Partitioning Kernels. While most random forest proximities are determined primarily through associations within terminal nodes, this approach selects a random tree height and partitions the data based on this higher-level splitting. The authors do not compare with other proximity definitions (nor do they frame their work in the context of random forest proximities) but they compare this random forest kernel to other typical kernels (linear, RBF, etc.) using a log-likelihood test. The random forest kernel outperformed the others in most cases and the authors visually demonstrated their kernel using 2D PCA plots. The code for this approach is not publicly available. Cao et al. introduced two random forest dissimilarity measures which are used in the context of multi-view classification [59] . The first measure (denoted RFDisNC) weights the proximity values by the proportion of correctlyclassified observations within each node, accounting for both in-and out-of-bag observations. The second (RFDisIH) is based on instance hardness. Euclidean distances between observations at each terminal node are calculated (using only feature variables which were used as splitting variables leading to the terminal node, to avoid the curse of dimensionality) and used as weights as a part of the dissimilarity measure. Given this distance, they use k-Disagreeing Neighbors (DN) in the formulation of the dissimilarity measure: The use of k-DN gives a notion of difficulty in classifying a particular observation. In the multiview problem, the dissimilarities from different views are averaged before classification. The authors showed that RFDisIH performed better overall on classification tasks compared with other multiview methods. However, RFDisIH was not compared with other random forest proximities in their commonly-used applications (e.g. visualization or imputation). A similarity measure can be constructed from RFDisIH as RFProxIH = 1− RFDisIH. While these alternative definitions have shown promise in their respective applications, their connection to the data geometry learned by the random forest is not clear. In contrast, we present a new definition of random forest proximities that exactly characterizes the random forest performance on both in-bag and out-of-bag samples: Definition 3 (Random Forest-Geometry-and Accuracy-Preserving Proximities (RF-GAP)). Let B(t) be the multiset of (potentially repeated) indices of bootstrap (in-bag) observations. We define J i (t) to be the set of in-bag observations which share the terminal node with observation i in tree t, or J i (t) = B(t)∩v i (t) with cardinality |J i (t)|. Then, for given observations i and j, their proximity measure is defined as: That is, considering only trees for which observation i is outof-bag, the proximity between i and j is the average proportion of in-bag observations in the shared terminal node of i and j over all trees where i is out of bag and j is in bag. This proposed definition is, in part, inspired by the work of Lin and Jeon [60] . We show in Section 4 that the random forest OOB prediction (and thus the generalization error rate) is exactly reproduced as a weighted sum (for regression) or a weighted-majority vote (for classification) using the proximities in Definition 3 as weights. Thus, this definition characterizes the random forest's predictions, keeping intact the learned data geometry. Subsequently, applications using this proximity definition will provide results which are truer to the random forest from which the proximities are derived. Here we show that the random forest prediction is exactly reproduced as a weighted sum (for regression) or a weighted-majority vote (for classification) using RF-GAP as weights. We first show that for a given observation, the proximities are non-negative and sum to one. In contrast, the proximities in Definitions 1 and 2 must be row-normalized to sum to one. Note that we require that p GAP (i, i) = 0 for the proximities to sum to one, although the exact value for p GAP (i, i) does not matter in practice as it is not considered in the proximity-weighted prediction. Proof: It is clear from the definition that p GAP (i, j) ≥ 0 for all i, j. The sum-to-one property falls directly from the definition: This proposition allows us to directly use RF-GAP as weights for classification or regression. We show that the proximity-weighted prediction under Definition 3 matches the random forest OOB prediction, giving the same OOB error rate in both the classification and regression settings. The OOB error rate is typically used to estimate the forest's generalization error and quantify its goodness of fit, this indicates that RF-GAP accurately represents the geometry learned by the random forest. Theorem 1 (Proximity-Weighted Regression) . For a given training data set S = {(x 1 , y 1 ) . . . (x N , y N )}, with y i ∈ R, the random forest OOB regression prediction is exactly determined by the proximity-weighted sum using RF-GAP (Definition 3). Proof: For a given tree, t, and i ∈ O(t), the decision tree predictor of y i is the mean response of the in-bag observations in the appropriate terminal node. That is, The random forest prediction,ŷ i , is the mean response over all trees for which i is out of bag. That is, The proximity-weighted predictor,ŷ p i , is simply the weighted sum of responses, {y j } j =i . Theorem 2 (Proximity-Weighted Classification). For a given training data set S = {(x 1 , y 1 ) . . . (x N , y N )}, with y i ∈ {1, · · · , K} for all i ∈ {1, · · · , N }, the random forest OOB classification prediction is exactly determined by the weightedmajority vote using RF-GAP (Definition 3) as weights. Thus, the random forest classification prediction,ŷ i , is the most popular predicted class over all t ∈ S i : We show equivalence with the proximity-weighted predictor. The proximity-weighted predictor predicts the class with the largest proximity-weighted vote: The last line holds as |S i | does not depend on k. As classification trees in a random forest are grown until terminal (leaf) nodes are pure, all in-bag observations belong to the same class. Denote the common class for any observation j ∈ J i (t) as y i,t . Then the single tree predictor is given bŷ The proximity-weighted predictor is thuŝ To demonstrate that the RF-GAP proximities preserve the random-forest learned data geometry, we empirically validate Theorems 1 and 2 from Section 4, demonstrating that the random forest predictions are preserved in the proximity construction. We also compare to the proximity-weighted predictor using the original random forest proximity definition (Definition 1), the OOB adaptation (Definition 2), PBK [57] , and RFProxIH [59] . We compared proximity-prediction results on 19 datasets from the UCI repository [61] . Each dataset was randomly partitioned into training (80%) and test (20%) sets. For each dataset, the same trained random forest was used to produce all compared proximities. Table 1 gives the absolute difference between the proximity-weighted training errors and the random forest OOB error rate. The proximityweighted predictor using RF-GAP almost exactly matches the random forest OOB error rates; discrepancies are due to random tie-breaking. In contrast, the original definition, PBK, and RFProxIH typically produce much lower training error rates, suggesting overfitting. The OOB proximities (Definition 2) produce training error rates which are sometimes lower and sometimes higher than the random forest OOB error rate. Table 1 also provides the difference between the test error rates for the same datasets. Here, we still see that the proximity predictions using RF-GAP nearly always match those of the random forest, while this is not the case for the other proximity constructions. The RF-GAP proximities also generally produce the lowest test errors. This can be seen in Fig. 2 , which plots the training versus test error rates using the different proximity measures. Table 2 gives the regression slope for each proximity definition. From here it is clear that the original proximities, PBK, and RFProxIH overfit the training data on average. This is corroborated in Fig. 3 , which plots the difference between the random forest out-of-bag error rates and the proximity-weighted errors across the same datasets, demonstrating that the RF-GAP predictions nearly perfectly match those of the random forest for both training and tests sets. It is clear that the original definition, PBK, and RFProxIH overfit the training data in contrast. In this section we demonstrate that using RF-GAP in multiple applications leads to improved performance relative to the other random forest proximity measures. We perform comparisons with applying MDS to the proximities for visualization in Section 6.1, data imputation in Section 6.2, and outlier detection in Section 6.3. Random forest proximities have already been established as successful in these applications (see Section 2). Thus, we focus our comparisons in these applications on existing random forest proximity definitions to show that the improved representation of the random forest geometry in RF-GAP leads to improved performance. Additional results beyond those presented here are given in Appendix A. Leo Breiman [1] , [16] first used multi-dimensional scaling (MDS) on random forest proximities to visualize the data. Since then, MDS with random forest proximities has been used in many applications including tumor profiling [29] , visualizing biomarkers [40] , pathway analysis [30] , multiview learning [28] , [49] , and unsupervised learning [39] . 1 Comparison of proximity-weighted predictions to the random forest errors. The random forest OOB error (on the training set) and test errors are given in the columns under RF. The absolute difference of the training and test errors with, respectively, the OOB error and RF test error are given for each proximity-weighed prediction. The results nearest the random forest predictions are bold. RF-GAP nearly perfectly matches the random forest results, with differences being accounted for by randomly broken ties in the forest. The other definitions tend to overfit the training data, as can be seen with the large differences between the OOB and test error rates. This is corroborated by Fig. 3, Here, we compare visualizations using the five different proximities. In each case, metric MDS was applied to √ 1 − prox to produce two-dimensional visualizations. Figure 4 gives an example of MDS applied to the clas- sic Sonar dataset from UCI [61] with an OOB error rate was 15.85%. RF-GAP proximities (Fig. 4 (a) ) show two class-groupings with misclassified observations between the groups or within the opposing class. The original proximities, PBK, and RFProxIH ( Fig. 4 (b, d, and e, respectively) ) show a fairly clear separation between the two classes. For these proximities, they appear nearly linearly separable which does not accurately reflect the data nor the geometry learned by the random forest given an error rate of nearly 16%. Definition 2 (Fig. 4 (c) has a similar effect as the RF-GAP definition, but with a less clear boundary and seemingly misplaced observations that are deep within the wrong class. These results suggest that RF-GAP can lead to improved supervised visualization and dimensionality reduction techniques. See Appendix A for further experiments. Fig. 3 . These boxplots show the absolute difference between the proximity-weighted prediction training and test errors with the random forest OOB error rate and test error, respectively, across five proximity definitions. RF-GAP proximity predictions most nearly match the random forest predictions for both the training (left) and test (right) data, thus, best preserving the geometry learned by the random forest. Various UCI datasets were randomly split into training and test datasets for this (80% training, 20% test). In [35] , experimental results showed that, out of nine compared imputation methods across seven different missing mechanisms (missing at random, missing completely at random, etc.), random forest-based imputation was generally the most accurate. This has been corroborated by [36] , [37] , [38] . Here we describe the algorithm for random forest imputation and compare results using various proximity definitions. To impute missing values of variable j: values. If categorical, replace the missing values with the proximity-weighed majority vote. (5) Repeat steps 2 -4 as required. In many cases, a single iteration is sufficient. We show empirically that random forest imputation is generally improved using the RF-GAP proximity definition. For our experiment, we selected various datasets fom the UCI repository [61] and for each dataset, we removed 5%, 10%, 25%, 50%, and 75% of values at random (that is, the values are missing completely at random, or MCAR), using the missMethods R package. Two comparisons were made: 1) we computed the mean MSE across 100 repetitions using a single iteration, and 2) we computed the mean (across 10 repetitions) MSE at each of 15 iterations. A summary of performance rankings is given in Table 3 . Across all compared proximity definitions (RF-GAP, OOB, Original, and RFProxIH), RF-GAP achieved the best imputation scores at all percentages less than 75%, and outperformed across 69% of the datasets when the percentage of missing values was 75%. For full results, see Table A .1 in the supplementary materials which gives the mean MSE across 100 repetitions using a single iteration of the above-described algorithm. The number of observations and variables for each dataset are provided in the table. Supplementary figures (Fig. A.7 , A.8, and A.9) compare imputation results across 16 datasets, using 15 iterations in each experiment with the mean MSE and standard errors recorded for each of the repetitions. The value recorded at iteration 0 is the MSE given the median-or majorityimputed datasets. In many cases, the imputation appears to converge quickly with relatively few iterations. Generally, the RF-GAP proximities outperformed the other definitions at each number of iterations. For high percentages of missing values (at least 75%), or for small datasets, the random forest imputation does not always converge and performance may actually decrease as the number of iterations increases. These results suggest that RF-GAP can be used to improve random forest imputation. Random forests can be used to detect outliers in the data. In the classification setting, outliers may be described as observations with measurements which significantly differ from those of other observations within the same class. In some cases, these outliers may be similar to observations in a different class, or perhaps they may distinguish themselves from observations in all known classes. In either case, outlying observations may negatively impact the training Fig. 4 . Comparison of MDS embeddings using different RF proximity definitions. Proximities were constructed from a random forest trained on the two-class Sonar dataset (208 observations of 60 variables) from the UCI repository which gave an OOB error rate of 15.87%. Multi-dimensional scaling (MDS) was applied to √ 1 − prox using (a) RF-GAP proximities, (b) the original proximities, (c) OOB proximities (Definition 2), (d) PBK [57] , and RFProxIH [59] . Using RF-GAP proximities, the visualization depicts a good representation of the forest's classification problem. For correctlyclassified points (dots), there are two clear groupings, while misclassified points (squares) are generally located between the groupings or found within the opposite class' cluster, albeit closer to the decision boundary than not. The original definition, PBK, and RFProxIH over-exaggerate the separation between classes. This is apparent in examples (b), (d), and (e) as the two classes appear nearly linearly-separable which does not accurately depict the random forest's performance on the dataset. Using only OOB samples to generate the proximities improves upon those three but seems to add some noise to the visualization. There are still two major class clusters, but some correctly classified points are found farther inside the opposite class' cluster compared to the RF-GAP visualization. of many classification and regression algorithms, although random forests themselves are rather robust to outliers in the feature variables [2] . Random forest proximity measures can be used to detect within-class outliers as outliers are likely to have small proximity measures with other observations within the same class. Thus, small average proximity values withinclass may be used as an outlier measure. We describe the algorithm as follows: (1) For each observation, i, compute the raw outlier measure score as j∈class(i) n prox 2 (i,j) . (2) Within each class, determine the median and mean absolute deviation of the raw scores. In this paper, we presented a new definition of random forest proximities called RF-GAP that characterizes the random forest out-of-bag prediction results using a weighted nearest neighbor predictor. We proved that the performance of the proximity-weighted predictor exactly matches the out-of-bag prediction results of the trained random forest and also demonstrated this relationship empirically. Thus, RF-GAP proximities capture the random forest-learned data geometry which provides empirical improvements over other existing definitions in applications such as proximityweighted prediction, missing data imputation, outlier detection, and visualization. Additional random forest proximity applications can be explored in future works, including quantifying outlier detection performance and comparing against other, nontree based methods, assessing variable importance, and applying RF-GAP to multi-view learning. This last application shows much promise as classification accuracy was greatly increased after combining proximities in [28] , [49] using other definitions. An interesting visualization application was introduced in [31] which may see improvements using geometry-preserving proximities. Multi-view learning may also be paired with this approach in some domains to visualize and assess contributions from the various modes or to perform manifold alignment. An additional area of improvement is scalability. Our approach is useful for datasets with a few thousand observations, but we can expand its capabilities by implementing a sparse version of RF-GAP. Further adaptations may make random forest proximity applications accessible for even larger datasets. [61] . This binary classification problem predicts whether or not returned radar signals are representative of a structure (good) or not (bad). We see a similar pattern here regarding the MDS embeddings as in Figure 4 .The class separations are somewhat exaggerated for the Original, PBK, and RFProxIH proximities, while points clearly susceptible to misclassification are identifiable in the RF-GAP and OOB plots. Here we present additional experimental results, demonstrating that using RF-GAP leads to improved imputation, visualization, and outlier detection over the other random forest proximity measures. Here we provide additional examples of MDS applied to the various random forest proximities. In Figure A .1, we compare the plotted MDS embeddings on the Ionosphere data from the UCI [61] . It is clear from the images that the random forest's misclassified points are typically found on the border between the two class clusters in the RF-GAP embeddings while this is not always the case for the other proximity measures. Additional figures (A.2, A.3, and A.4) are given to display MDS applied to the proximities. In Figure A. 2 we see similar patterns which were displayed in Figure 4 ; exaggerated separation in the Original and RFProxIH and excess noise in OOB. RF-GAP seems to accurately portray why the misclassifications are made in the context of proximity-weighed predictions. Figures A.4 , A.5, and A. 6 give additional examples of proximity-based outlier scores. Points that are farther from their respective class clusters can be viewed as outliers and are often misclassified. The point size is proportional to the outlier measure in the figure. This, however, may not be as clear when two or three points are far from their respective cluster but near to each other. Here we show extended results on data imputation using random forest proximities. Table A .1 shows the average imputation results for 100 trials across 16 datasets from the UCI repository [61] using four of the proximity measures with a single iteration. For each dataset, values were removed completely at random in amounts of 5%, 10%, 25%, 50% and 75%. The PBK proximities were omitted from this study due to their slow computational complexity. Additionally, some datasets were not compatible with RFProxIH due to continuous responses or categorical features. The mean squared error (MSE) is almost universally lower when using RF-GAP for imputation. RF-GAP is only outperformed sometimes when the amount of missing values reaches 75%. Even in these cases, RF-GAP is always in second place. The Banknote, Ionosphere, Optical Digits, Parkinsons, and Waveform datasets particularly show good examples of RF-GAP for imputation. Here, RF-GAP outperforms each of the other definitions and the error decreases monotonically. Figures A.7 , A.8, and A.9 show imputation results across multiple iterations. Each experiment was repeated 100 times across 15 iterations. In general, RF-GAP outperforms the other proximity-weighted imputations although the imputation tends to be much noisier for smaller datasets (see Balance Scale, Ecoli, Iris, and Seeds, for example) and less reliable for large percentages of missing values. This is particularly prominent when 75% of the data is missing. In some of these cases, the error increases with the number of iterations, for example, in the Ecoli and Diabetes data sets. [61] with MDS applied to the random forest proximities. Here, point size is proportional to the outlier score provided by each method. In each case, observations with high outlier scores corresponded to misclassified points. Complete results for data imputation using random forest proximities using a single iteration. For each considered dataset, values were removed completely at random in the amounts of 5%, 10%, 25%, 50% and 75%. Missing values were imputed using various proximity definitions (PBK was not used here for computational considerations). The experiment was repeated 100 times for all datasets. The two numbers directly below each dataset indicates the number of observations and number of variables, respectively. The average of the mean-squared errors (MSE) between the original and imputed values are recorded along with the standard errors. For missing value percentages up through 50%, RF-GAP proximities (Definition 3) provided the most accurate imputation across all datasets. At 75% missing values, RF-GAP proximities outperformed across 69% of datasets. Otherwise, the RF-GAP proximities are in second place. Note: some datasets were not compatible with RFProxIH due to continuous response or categorical features vectors. All data variables were scaled from 0 to 1 for comparability. The scores were compared using 1 to 15 imputation iterations as described in Section 6.2 and each experiment was conducted over 10 repetitions using the original proximities, OOB proximities, RF-GAP proximities, and RFProxIH. Imputation 0 provides the MSE for the median-filled imputation. Four different percentages of values missing completely at random (MCAR) were used (5%, 10%, 25%, and 50%) across several datasets from the UCI repository [61] . In general, RF-GAP outperforms the other proximity-weighted imputations. See additional results in Table A .1 for a single iteration. Random forests Random Forests Solar radiation forecasting using artificial neural network and random forest methods: Application to normal beam, horizontal diffuse and global components Comparison of random forest, artificial neural networks and support vector machine for intelligent diagnosis of rotating machinery Covid-19 patient health prediction using boosted random forest algorithm Spatio-temporal estimation of the daily cases of covid-19 in worldwide using random forest machine learning algorithm Shallow landslide susceptibility mapping by random forest base classifier and its ensembles in a semi-arid region of iran Forecasting incidence of infectious diarrhea using random forest in jiangsu province, china Study of cardiovascular disease prediction model based on random forest in eastern china Predicting the deforestation probability using the binary logistic regression, random forest, ensemble rotational forest, reptree: A case study at the gumani river basin, india Prediction of nanofluids viscosity using random forest (rf) approach 2-stage modified random forest model for credit risk assessment of p2p network lending to "three rurals" borrowers Rf-pseu: A random forest predictor for rna pseudouridine sites Wearable sensors for activity analysis using smo-based random forest over smart home and sports datasets Estimation of the spatial distribution of heavy metal in agricultural soils using airborne hyperspectral imaging and random forest Random forests Visualizing data using t-SNE Visualizing structure and transitions in highdimensional biological data Visualizing high dimensional dynamical processes Extendable and invertible manifold learning with geometry regularized autoencoders A global geometric framework for nonlinear dimensionality reduction UMAP: Uniform manifold approximation and projection for dimension reduction On spectral clustering: Analysis and an algorithm Support-vector networks Kernel principal component analysis Distance metric learning using random forest for cytometry data Conf. of the IEEE Engineering in Medicine and Biology Society (EMBC) Random forest-based manifold learning for classification of imaging data in dementia Tumor classification by tissue microarray profiling: random forest clustering applied to renal cell carcinoma Pathway analysis using random forests classification and regression Random forestbased diffusion information geometry for supervised visualization and data exploration Random-forests-based network intrusion detection systems Random forests for land cover classification Accurate identification and detection of outliers in networks using group random forest methodoly Random forest-based imputation outperforms other methods for imputing lc-ms metabolomics data: a comparative study Predicting missing values: a comparative study on non-parametric approaches for imputation Missing data imputation through the use of the random forest algorithm Comparison of random forest and parametric imputation models for imputing missing data using mice: a caliber study Unsupervised learning with random forest predictors Cerebrospinal fluid proteomic biomarkers for Alzheimer's disease Outlier detection in sensed data using statistical learning models for iot Detecting outliers in species distribution data Random forest as one-class classifier and infrared spectroscopy for food adulteration detection The weirdest SDSS galaxies: results from an outlier detection algorithm Accurate identification and detection of outliers in networks using group random forest methodoly mice: Multivariate imputation by chained equations in r Gene selection using random forest and proximity differences criterion on dna microarray data Explicating feature contribution using random forest proximity distances Random forest dissimilarity based multi-view learning for radiomics application A random forest proximity matrix as a new measure for gene annotation Propensity score and proximity matching using random forest The elements of statistical learning: Data mining, inference, and prediction Classification and regression by randomforest ranger: A fast implementation of random forests for high dimensional data in C++ and R Random Forests R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing A novel approach to estimate proximity in a random forest: An exploratory study The random forest kernel and other kernels for big data from random partitions A novel random forest dissimilarity measure for multi-view learning Random forests and adaptive nearest neighbors UCI machine learning repository