ESWA2DOI.dvi Prediction of Survival Probabilities with Bayesian Decision Trees V. Schetinin, L. Jakaite Computer Science Department and Technology, University of Bedfordshire, UK W. Krzanowski College of Engineering, Mathematics and Physical Sciences, University of Exeter, UK Abstract Practitioners use Trauma and Injury Severity Score (TRISS) models for pre- dicting the survival probability of an injured patient. The accuracy of TRISS predictions is acceptable for patients with up to three typical injuries, but unacceptable for patients with a larger number of injuries or with atypi- cal injuries. Based on a regression model, the TRISS methodology does not provide the predictive density required for accurate assessment of risk. Moreover, the regression model is difficult to interpret. We therefore consider Bayesian inference for estimating the predictive distribution of survival. The inference is based on decision tree models which recursively split data along explanatory variables, and so practitioners can understand these models. We propose the Bayesian method for estimating the predictive density and show that it outperforms the TRISS method in terms of both goodness-of-fit and classification accuracy. The developed method has been made available for evaluation purposes as a stand-alone application. Keywords: Bayesian prediction, survival probability, Markov chain Monte Carlo, classification tree, trauma care. DOI: 10.1016/j.eswa.2013.04.009 1. Introduction For patients alive on arrival at a hospital, the probability of survival is calculated by using a logistic regression model that employs the Trauma and Injury Severity Score (TRISS) system [5, 4, 27, 12, 28]. TRISS-based models Preprint submitted to Expert Systems With Applications April 22, 2013 consider up to three most severe injuries which a patient can obtain in six regions of the body: head, face, chest, abdomen, extremities, and external (skin, subcutaneous tissue and burns). The model includes both continuous and categorical screening tests: the first type includes age, systolic blood pressure, and respiratory rate, while the second type includes the severity scores of injuries a patient can obtain in the six body regions, as well as Glasgow Coma Scale (GCS) and type of injury. The screening tests are performed on the patient’s arrival by a trained scorer. The calculation of survival probabilities has been made available online as a TRISS Calculator [7]. The screening tests are used to form two aggregated predictors, Injury Severity Score (ISS), and Revised Trauma Score (RTS). The use of such aggregated predictors has revealed unexplained fluctuations of ISS over ob- served (or actual) probabilities of survival, and some researchers have raised a concern about its predictive ability [38, 4, 28, 1, 39]. The TRISS determines the probability of survival, Ps, using the following logistic regression model: Ps = 1/(1 + e−b), (1) where b = b0 + b1 × RTS + b2 × ISS + b3 × A. Here b0, b1, b2, and b3 are the regression parameters, and A is is the dichotomised age: A = 0, if age < 55, and A = 1, otherwise. The regression parameters were defined for blunt and penetrating injuries in [5]. The analysis of actual survival against predicted probabilities is defined as calibration, and a visualized relationship between values of these probabilities is a calibration curve, see e.g. [22, 53]. An ideal calibration curve is a 45 degree line with zero intercept. In this light the calibration curve for TRISS models has been found to deviate significantly from the ideal [28, 44]. It has been shown that the accuracy of TRISS models can be improved by updating the model parameters on new data, see e.g. [8, 36]. This ap- proach can be implemented efficiently when an appropriate model structure is known. However in practice such a model is difficult to identify. When goodness or fitness of a model can be measured in terms of its likelihood, the model can be fitted to data with the likelihood maximization method. In practice, this approach requires much effort to overcome the optimization problems while the desired improvement cannot be guaranteed [16, 23, 2, 37]. Despite the problems, the accuracy of TRISS predictions has been found acceptable when the types and severities of patients injuries are typical. How- 2 ever, for cases with four or more injuries as well as with some atypical com- binations of injuries, the accuracy could be improved [8, 27, 28]. It is highly desirable for practitioners to estimate the uncertainty in pre- dictions of survival. In general, estimates of uncertainty are required to min- imize risks of mistaken decisions and, in particular, to estimate confidence intervals. These intervals can be accurately estimated when a predictive probability density is fully known, but the desired estimates cannot be pro- vided within a concept such as TRISS which employs a maximum likelihood method to fit a logistic regression to data [43, 1]. The above problems motivated us to develop a Bayesian method for pre- diction of survival. In our research we used the US National Trauma Data Bank (NTDB) which is the major data source of records of injured patients admitted to hospitals and emergency units; these data are available for re- search from [12]. The NTDB data include information about a patients age, gender, type and regions of injuries along with some clinical and background information about a patients state. The NTDB also includes information about TRISS prediction and outcome of care, alive or died, for each patient. In our research we use well-known Decision Tree (DT) models which are induced from given data in such a way that features which make a distin- guishable contribution to the model outcome are selected. These features make axis-parallel partitions, and as a result users find the DT models inter- pretable [6, 18]. We tested the proposed Bayesian method on a set of patients registered in the NTDB with multiple injuries. We expected that in such cases the proposed method would significantly outperform the TRISS method. How- ever, in this research we did not generalize our method to the entire NTDB population including about 2 millions of patients because such a general- ization would require much more intensive computational experiments. For evaluation of our method, we developed a Bayesian calculator of survival as a stand-alone application available from [51]. 2. Related Work In the related literature we found a number of machine learning and sim- ulation methods which are competitive to conventional statistical methods. Simulation was principally via Markov chain Monte Carlo (MCMC) meth- ods, while ML methods included artificial neural networks and support vector 3 machines which are well known for providing non-linear fitting of models to data [3, 56]. The goodness-of-fit or calibration of predictive models produced by these methods was measured in terms of least square error. As for conventional statistical methods, the performance was measured in term of the receiver operating characteristic curve (ROC), or more specifically, by the area under this curve (AUC), see e.g. [18, 31]. The performance was also assessed in terms of accuracy of classification or discrimination measured as the ability of a model to discriminate survived patients from ones that died [27]. 2.1. Machine Learning Methods The study by Sujin et al[55] compared the mortality prediction models built with different Machine Learning methods on a data set collected by the University of Kentucky Hospital. The authors of this study compared the ar- tificial neural networks, support vector machines, DT, and conventional logis- tic regression models. The data set used for the experiments included 38,474 patients, information about which was represented by 41 variables. It was reported that only 15 out of these variables made a significant contribution to the prediction, including blood pressure, respiration rate, GCS, comor- bidity, and blood serum composition. The best performance was achieved with the DT model, AUC = 0.892, whilst the AUC for the artificial neu- ral networks, support vector machines, and logistic regression models were 0.874, 0.876, and 0.871, respectively. All the methods were implemented in the SPSS Clementine software [24]. In [52], an artificial neural network-based approach was described. This approach was compared with a logistic regression model on 13,164 patients whose physiological information was represented by 17 variables. Both the neural networks and the regression models were built to provide non-linear fits to the data, and their goodness-of-fit (or calibration) was evaluated in terms of least square error. It was reported that the resultant neural network model slightly outperformed the logistic regression in terms of AUC. The authors of study [25] employed a probabilistic artificial neural net- work for predicting mortality in emergency room. The network with 10 input variables was trained with a genetic algorithm. The calibration was evalu- ated using Hosmer-Lemeshow statistic. The study conducted on a data set of 533 patients revealed that the performances of the neural network and logistic regression models were comparable. 4 In [40], such machine learning methods as Naive Bayesian classifier, DT, support vector machine and artificial neural network were used for predicting survival of burn patients. The authors reported that the prediction accuracies of these methods were comparable. All the methods were implemented in the Weka software [21]. The data were represented by 10 features, namely age, gender, and percentages of burn in the eight body regions. The DT was induced with the C4.5 algorithm [41]. The study by Clermont et al [11] compared an artificial neural network method with a logistic regression model for predicting mortality in the in- tensive care units. The study was carried out on 1,647 patient cases repre- sented by 24 input features. The features included patient’s age, the values of 16 physiologic variables including temperature, heart rate, blood pres- sure, respiratory rate, oxygenation, composition of the blood serum, GCS, as well as binary indicators for the absence or presence of chronic conditions. The goodness-of-fit was evaluated using Hosmer-Lemeshow statistics, and the comparisons were made in terms of classification accuracy as well as in terms of AUC value. It was shown that the both methods had similar performances. In [2], an artificial neural network was compared against a TRISS model on the UK Trauma data represented by 16 anatomical and physiological pre- dictor variables. Both models provided the similar accuracy of classification, but the TRISS model showed better performance in terms of AUC, 0.941 ver- sus 0.921. The neural network model provided a better calibration in terms of Hosmer-Lemeshow statistics, 58.3 versus 105.4. It was also found that the head injury, age, and chest injury made the most important contribution to the outcome, while respiration rate, heart rate, and systolic blood pressure were underestimated (their contribution was less important). The authors concluded that the TRISS model was adequate but not optimal. A Bayesian belief network described in [54] was built to predict morbidity and outcomes in wounded patients. Bayesian belief networks are known as graphical models capable of explaining relationships between predictor vari- ables; such models cannot be developed with conventional logistic regression methods. The study conducted on a data set of 22 patients revealed that the logistic regression outperformed the Bayesian belief networks in terms of AUC. The described Bayesian belief network was developed to estimate one of three types of outcomes: likelihood of infection, requirement for intensive care, and impaired wound healing. The relationships between clinical vari- ables and outcomes were identified using DecisionQ FasterAnalytics Bayesian 5 modelling software[13]. It was found that the likelihood of infection could be estimated using serum albumin, injury severity score, and initial requirement for blood transfusion. The likelihood of intensive care admission was esti- mated using the blood transfusion requirement, physiological variables and serum biomarkers. Impaired wound healing was estimated using the indi- cator of intensive care admission, serum biomarkers and estimated hospital length of stay. The network consisted of 12 nodes, with two-to-three states per node. 2.2. Simulation The authors of review [30] found that conventional modelling techniques used in critical care could be improved by simulation methods such as MCMC. They analysed the usefulness and limitations of applications of these methods presented in the related literature and concluded that simulation provided practitioners with additional information on risks. In [33], conventional and Bayesian logistic random effects regression mod- els were compared for predicting outcomes on a data set of 8,509 patients with Traumatic Brain Injury. The Bayesian method has been implemented in statistical packages such as WinBUGS [34], MLwiN [42], MCMCglmm [20], and SAS [46]. It was reported that both methods provided similar prediction accuracy. The results of the Bayesian method were critically dependant on the chosen priors as well as on the sampler’s settings which can affect the convergence of the Markov Chain if given inappropriately. 3. Bayesian Predictions When Bayesian inference is employed for predictions, it is typically as- sumed that there exist a number of models which can appropriately approx- imate the relationship between predictor variables and output variable (or outcome) observed in given data. Given models with parameters, we can fit them to the data. It is most often the case that none of the models describes the true relationship between input and output variables. However, we as- sume that averaging over the models could result in more accurate approxi- mation to the true relationship. The most efficient averaging over models is achieved within the Bayesian methodology. However, when Bayesian meth- ods are applied to real data, a number of problems are raised. One specific problem we address in the paper is associated with Bayesian averaging over hierarchical models, such as DTs. 6 When DT models are collected in an ensemble, for interpretation purpose a single DT model with Maximum a Posterior likelihood can be selected as described in [10]. However our technique proposed in [47] for interpretation of an ensemble of DT models has been capable of finding a single DT model providing a better accuracy of predicting survival probabilities. The methodology of Bayesian averaging over DT models has been made computationally feasible with the MCMC method [9, 14]. This method aims to explore a posterior density of model parameters by making random walk proposals. The desired density is approximated by drawing samples from areas with high posterior density of the model parameters (so-called areas of interest). In the case of DT models, a model parameter space is often of variable size, and the MCMC method is extended to Reversible Jump (RJ) [19]. The desired approximation is then achieved when the RJ MCMC algorithm can explore all areas of the posterior density in a model parameter space of a variable size. However, a posterior density function is often multimodal and the detailed exploration of the areas of interest cannot be achieved in a reasonable time, see e.g. [43]. This affects the accuracy of approximation as the Bayesian model averaging tends to act more as model selection [17, 14]. In practice, when DT models (and hence a model parameter space) are large, results of Bayesian averaging are critically dependent on prior informa- tion as shown in [14, 43]. When prior information is available, the averaging is mostly done over areas of high posterior, and the estimates of the desired predictive density are likely to be accurate. However, when prior information is absent, the areas of possible interest cannot be specified and hence detailed exploration may not be possible in a reasonable time [14, 45]. As a result, the MCMC sampler cannot explore all possible areas of interest proportionally to the posterior density of the parameters. One possible reason for the above disproportion is that the RJ MCMC algorithm tends to simulate samples from an oversized model parameter space [9, 14]. In our previous work we attempted to reduce factors that cause the RJ MCMC to sample from overgrown DT models and proposed a sweeping strategy [48]. In the experiments on the benchmark problems, we described in [49], this strategy has been shown more efficient than the restarting [9] and restricting [14] MCMC strategies. In these experiments the proposed strategy also outperformed the conventional random forest [15]. In our previous research, we also observed that when prior information on predictors was absent, the posterior density cannot be explored in detail, 7 and some DT models were disproportionally sampled [26]. When in the post-analysis phase we evaluated the frequencies of using predictor variables in the DT models, we found that some predictors were employed rarely. We assumed that these predictors made a weak contribution to the outcome. When we removed DT models which explored such weak predictors from the ensemble, we observed a decrease in entropy of the model mix. The fact of decreasing entropy has been proven as an indicator of improving estimates of the predictive density, see e.g. [32]. The above analysis motivated us to extend the methodology of Bayesian averaging over DT models for predicting survival probabilities. We attempt to improve the RJ MCMC method used for implementation of the Bayesian methodology. We are also interested in exploring the importance of the pre- dictive variables within the proposed method, and expect that the posterior information on predictors can be useful to optimize existing procedures of scoring injury severities. The methodology of Bayesian averaging over DT models is well developed and described in the literature (see e.g. [9, 14]). The details of the Bayesian method and the proposed MCMC strategy are described in the Appendix. In the following sections we explore the proposed strategy on a data set of patients from the NTDB. We attempt to improve the accuracy of estimates of predictive density. 4. Trauma Data We used data on patients registered in the NTDB who were alive on ar- rival at hospitals, and whose survival probabilities have been calculated with the TRISS method. Table 1 shows the screening tests denoted as variables x1, . . . , x17 which were used for the TRISS predictions. Variables x1 (age), x4 (blood pressure), and x5 (respiration rate) are continuous, and the others are categorical. The predicted output is the discharge status, y = {0, 1}. Figure 1 shows that the TRISS predictions and the observed survival probabilities progressively decrease with the number of injuries ranging from 1 to 20. The actual survival of patients with one injury was 0.98 while for patients with 20 injuries it was 0.71. The proportions of these patients were 0.17 and 0.0002, respectively. We see that the predicted values are below the actual frequencies and that the difference between their values, or prediction error, tends to increase with the number of injuries obtained by a patient. 8 Table 1: Screening Tests Test Name Range x1 Age 0-99 x2 Gender female, male x3 Injury type penetrating, blunt, burn x4 Blood pressure 0-299 x5 Respiration rate 0-59 x6 GCS Eye 1-5 x7 GCS Verbal 1-5 x8 GCS Motor 1-6 x9 Head severity 0-6 x10 Face severity 0-6 x11 Neck severity 0-6 x12 Thorax severity 0-6 x13 Spine severity 0-6 x14 Abdomen severity 0-6 x15 Upper extremity severity 0-6 x16 Lower extremity severity 0-6 x17 External severity 0-6 y Discharge status alive, dead The majority of patients were registered with 1 to 3 injuries and the TRISS predictions for this largest group of patients are close to the observed survival. The proportions of patients with a larger number of injuries are smaller and so the TRISS model does not fit these data well. In our research we attempt to improve the accuracy of predictions for such groups of pa- tients. In particular, we target a group which includes all the 14,840 patients registered in the NTDB with 11 to 15 injuries. This set does not include about 2% of patient’s records we found with one or more missing values as we did not attempt to fill the absent values in this research. Figure 2 shows the calibration curve of the TRISS model for this group of patients. We see that the observed probabilities are significantly higher than the predicted values. The difference is largest for patients with predicted survival between 0.1 to 0.5. We can evaluate quantitatively the goodness-of-fit (or calibration) of the 9 0 2 4 6 8 10 12 14 16 18 20 0.4 0.2 0.6 0.8 1 1.2 Number of injuries A ct u a l a n d T R IS S s u rv iv a l TRISS Prediction Actual Survival Figure 1: Observed and predicted survival probabilities for patients with differ- ent numbers of injuries. The TRISS predictions and the observed survival probabilities progressively decrease with the number of injuries ranging from 1 to 20. TRISS model for this group of patients by using the Hosmer-Lemeshow (HL) statistic used in the related literature [27, 4, 53]. When we split the data into 30 equally sized subsets, its value was 3680.5. 5. Experiments We used the above data to test the proposed MCMC simulation strat- egy. The experiments were run within 3-fold cross-validation. We compared the TRISS and Bayesian predictions in terms of classification accuracy, and Hosmer-Lemeshow statistic. The results were obtained using settings which allowed us to achieve a stationary distribution of the Markov chain and an efficient acceptance rate, while DT models were of a reasonable size. These settings are described in the Appendix. Figure 3 shows the calibration curve for the proposed Bayesian model. We can see that the Bayesian predictions are much closer to the observed survival rate than the TRISS predictions shown in Figure 2. The value of the Hosmer-Lemeshow statistic was significantly reduced from 3680.5 to 93.4. We compared the classification accuracies of the Bayesian and TRISS 10 Figure 2: Calibration curve for TRISS model for patients with 11-15 injuries. The observed probabilities of survival are significantly higher than the predicted values. The difference is largest for patients with predicted survival between 0.1 to 0.5. methods. Table 2 shows the classification accuracy (AC), true positive (TP), false negative (FN), true negative (TN), false positive (FP), sensitivity (SE), and specificity (SP) which were calculated by assigning the outcome alive if a patient’s survival prediction is higher than 0.5. We can observe that the accuracy of the Bayesian method is higher by 6%. Having a significantly higher FN rate (0.129 versus 0.024), the TRISS model provides the better TN rate (0.121 versus 0.078). Using the standard bootstrap method, we found that the p-values were less than 0.001 for all the statistical tests. Table 2 shows that the accuracy of the proposed Bayesian method is higher than that of the TRISS method by approx 6%. This part of classi- fication error is reducible, and this error appears when a decision threshold t used for classification differs from the optimal and thus becomes biased. Setting threshold t below the optimal value decreases the FP and increases the FN rates and, vice versa, setting the t above the optimal value increases the false positive and decreases the false negative rates. Therefore a value of t can be set so as to find a compromise between the SE and SP of a diag- 11 Figure 3: Calibration curve for the proposed Bayesian model. The Bayesian predictions are much closer to the observed survival rate than the TRISS predictions. nostic method, calculated as SE=TP/(TP+FN) and SP = TN/(TN+FP). The sensitivity, SE, shows how the diagnostic method is sensitive to posi- tive results, and the specificity, SP, shows how the method specifies negative results. In our case, positive results are associated with the conditions of a ”died” patient, and negative with those of an ”alive” patient. The sensitivity of the Bayesian method shown in Table 2 for the threshold t = 0.5 is 0.4868, that is less than the 0.7588 provided by the TRISS method. To achieve a similar sensitivity rate, we can increase the threshold t. For example, we can set t = 0.74 to increase the sensitivity to 0.7496, accepting a small decrease in the accuracy from 0.8936 to 0.8671, as shown in the third line of Table 2. In this case the differences between the values of confusion matrices for the Bayesian and TRISS methods remain significant with p-value < 0.001. We also tested the restarting MCMC strategy of sampling DT models [9, 14] on the same set of patients. The Hosmer-Lemeshow statistic was 17% higher on average than that for the proposed method. The classification accuracy was not significantly lower, however the uncertainty of estimates 12 Table 2: Performances of the TRISS and Bayesian Model (BM) with thresholds t Model AC TP FN TN FP SE SP TRISS 0.832 0.121 0.038 0.708 0.129 0.759 0.846 BM, t = 0.50 0.894 0.078 0.082 0.816 0.024 0.487 0.971 BM, t = 0.74 0.867 0.120 0.040 0.747 0.093 0.750 0.890 of the predictive density in terms of entropy of model mixing was signifi- cantly higher than that for the proposed method. This allows us to conclude that the proposed MCMC simulation strategy is capable of providing better conditions for Bayesian averaging over DT models. 6. Importance of Screening Tests The ensemble of DT models collected during MCMC simulation allows us to estimate the contribution of the predictive variables (screening tests) to the outcome. The importance of the variables can be estimated in terms of frequencies of using them in the ensemble. These frequencies (or posterior probabilities) are shown in Table 3. We can see that the most important contribution is made by the variables x4 (Blood pressure), x9 (Head severity), and x15 (Upper extremity severity). By contrast, the variables x5 (Respiration rate), x11 (Neck severity), and x17 (External severity) are least important, and therefore their contribution can be insignificant for predicting the survival of patients in the target group. 7. Bayesian Calculator of Survival Probabilities For evaluating the proposed Bayesian method, we developed a Calculator for predicting survival and tested it on the target data set described in Sec- tion 4. The Calculator allows the user to compare the Bayesian and TRISS predictions for a random patient drawn from the data set. The comparison is made in terms of prediction accuracy as described in Section 5. The user can also input new values of the screening tests to make predictions. It is important that the Calculator allows the user to estimate the pre- dictive probability density in order to assess the confidence intervals, which are associated with risk of making mistaken decisions. These estimates are 13 Table 3: Importance of Screening Tests Test Name Importance x4 Blood pressure 0.157 x9 Head severity 0.122 x15 Upper extremity severity 0.115 x1 Age 0.107 x13 Spine severity 0.093 x12 Thorax severity 0.081 x16 Lower extremity severity 0.064 x2 Gender 0.053 x10 Face severity 0.052 x8 GCS Motor 0.050 x14 Abdomen severity 0.045 x6 GCS Eye 0.018 x7 GCS Verbal 0.014 x3 Injury type 0.014 x5 Respiration rate 0.006 x11 Neck severity 0.004 x17 External severity 0.003 made individually for each patient, whilst the TRISS method is unable to provide such estimates. Figure 4 presents a screenshot of the calculator interface. The first column in the table Screening Tests shows the 17 screening tests that are described in Table 1. The second column shows the ranges of these tests. The third column displays values which the user can input or edit within the specified ranges. The graph Predicted Probabilities of Survival displays the probabilities of survival for a patient with the given screening tests. Each of the predicted probabilities can be interpreted as a hypothesis which is tested on the data set in the context of Bayesian inference. The bars on the graph show the observed probabilities of these hypotheses. The estimates of the predictive density shown in the graph provide all the information required to calculate the confidence intervals. Consider the example shown in Figure 4 for patient No 11311 with TRISS 14 Figure 4: Bayesian Calculator Screenshot. The table Screening Tests shows the 17 screening tests and the values of these tests which the user can input or edit within the specified ranges. The graph Predicted Probabilities of Survival shows the probabilities of survival for a patient. The estimates of the predictive density shown in the graph provide the information required to calculate the confidence intervals. survival probability 0.680 and outcome alive. For this patient, the Calcula- tor predicts a survival probability of 0.605. As this value exceeds 0.5, the predicted outcome is alive. The locations and heights of the bars shown in the graph present the estimates of predictive density. All the bars on the left from the 0.5 mark on the x-axis represent low probabilities of survival associated with the outcome died whereas all the bars on the right represent high probabilities of survival. Observing these probabilities, the user can analyse the risk for this patient. In this example, the sum over the first bars (on the left from 0.5) is smaller than the sum over the bars on the right. The substantial proportion of the former bars warns the user about a high death risk attached to this prediction. These bars make the predicted probability distribution wider and the uncertainty interval larger. In addition to a predicted probability of survival, the user can analyse the confidence interval calculated for a given patient. The lengths of the intervals can be associated with the difficulty of treatment for patients – the larger the interval, the more difficult is the case. We also observed some bars with unex- 15 pectedly low values of the observed probabilities, see the bar located around 0.51 in the graph, which is a case of a multimodal distribution. The user can conclude that patients with such predictions require special attention, and an investigation of similar cases could provide additional information about possible causes of the uncertainty. The Calculator can be downloaded from a web page [51] to be installed and run on a Windows XP 32-bit or Linux 64-bit machine The Bayesian risk assessments are computationally expensive, and so a high performance machine with a 64-bit processor and 4 GB memory is recommended. The Calculator can be used for evaluation purposes and allows users to observe both the predictive distribution and the calculated confidence intervals. 8. Discussion In the related publications reviewed in Section 2, the main Machine Learn- ing methods (Artificial Neural Networks, Support Vector Machines, and De- cision Trees) have not been found to outperform the TRISS method in terms of accuracy of predicting survival on the trauma data. In the Introduction we noted that the Machine Learning methods cannot provide an assessment of the predictive posterior distribution as proposed in our work. None of these methods is currently used in emergency care. It is important to note also that the use of Machine Learning methods (in particular Support Vec- tor Machines) requires careful and accurate setting of parameters (such as kernels) as well as of a proper model structure. The use of Artificial Neural Networks requires using a proper back-propagation algorithm and its learn- ing parameters. Finding proper settings typically requires the analysis of results of experiments run with different parameters of a machine Learning method and a model structure. However when such experiments are done, one might question whether the settings were explored within only a limited range of possible values. This is what we called the problem of likelihood maximization. In particular, we found that a single classification tree model [6], con- structed with the Matlab Statistics Toolbox [35], can provide only the fre- quentist estimation of survival, although its discrimination performance was comparable with that of the proposed method. As we discussed in the Introduction, when a model structure is defined properly we can use the likelihood maximization method to fit the model to a given data set. Optimal results will be guaranteed if the model likelihood 16 function is unimodal. However, often we do not know whether our model structure is given properly or whether its likelihood function is unimodal. A more practical scenario is when we have a few models, each of which provides a suitable goodness-of-fit in a particular area of the model parameter space. The use of Bayesian methodology in such cases allows us to average outcomes of the models according to their likelihood values and given prior information. The Bayesian average has been shown improving results even if there is no prior information on models and the so-called uniform (or flat) prior is used, see e.g. [14]. The improvement is, however, achievable if the models do not suffer from the overfitting problem as discussed in [17]. Therefore the average provides a practical way to combine prior information (if available) with a set of models which the user could consider competitive to a single model believed to be of a suitable structure. In this context, the strength of the Bayesian method is in averaging over possible models regardless of how much we know about a proper model structure and its parameters. When a single DT model is built, its likelihood is estimated for a new split made by an assigned feature. If the likelihood is increased, the new split is included in the model, and the assigned feature is assumed making a distinguishable contribution to the classification. Otherwise, the new split is unlikely included in the model as the splitting feature is unable to make a significant contribution. Therefore, a resultant DT model includes only fea- tures that make a significant contribution. The importance of these features cannot be evaluated within a single DT. However, the MCMC method allows us to generate DT models of different configurations, and so we can calculate the frequencies of using the features in these models; these frequencies allow us to estimate the desired feature importance. 9. Conclusion We analysed conventional TRISS-based regression models for predicting survival probabilities of injured patients. In the related literature we found firstly that the prediction accuracy of the TRISS method can be improved and secondly that such attempts have been made by employing the ML and simulation methods. However, we have not found evidences that any of these attempts significantly improved the TRISS method. Based on a regression model, the TRISS method cannot provide the esti- mates of predictive probability density that is required to evaluate confidence intervals. The ISS, which is used as an aggregated predictor for TRISS, is ob- 17 served with unexplainable fluctuations and so may be misleading. Moreover practitioners find that the TRISS models are difficult to interpret. This moti- vated us to explore Bayesian model averaging over DT models for predicting survival. In this paper, we analysed the implementation of the Bayesian method- ology with RJ MCMC simulation and found that in the burn-in simulation phase DT models tend to grow excessively. The existing MCMC strategies were unable to manage the excessive growth efficiently in terms of diversity of model mixing. Moreover these strategies require additional settings which have to be adjusted experimentally. To provide better conditions for detailed exploration of the posterior den- sity during the simulation, we proposed a sweeping strategy. This strategy was tested on a large set of patients registered in the NTDB with multi- ple injuries. The results showed that Bayesian model averaging significantly outperforms the TRISS-based model in terms of goodness-of-fit and classifi- cation accuracy. Moreover, the proposed strategy has slightly outperformed the conventional MCMC strategy. The use of DT models for the Bayesian averaging allowed us to estimate the contribution of each screening test to the outcome. This is a reasonable argument for practitioners to use the DT models discussed in Section 3 in the context of the transparency of models and their ability to select important explanatory variables. The above results allow us to conclude that the proposed method is ca- pable of improving the accuracy of predictions for survival of a patient with multiple injuries. The desired confidence intervals can be accurately esti- mated for each patient. Information about the importance of screening tests could be useful for cost analysis and for further improvement of the prediction accuracy. References [1] T. Bailey, R. Everson, J. Fieldsend, W. Krzanowski, D. Partridge, V. Schetinin, Representing classifier confidence in the safety critical do- main an illustration from mortality prediction in trauma cases, Neural Computing and Applications 16 (2007) 1–10. [2] D. Becalick, T. Coats, Comparison of artificial intelligence techniques with UK TRISS for estimating probability of survival after trauma, Journal of Trauma 51 (2001) 123–133. 18 [3] C.M. Bishop, Pattern Recognition and Machine Learning, Springer, 2007. [4] O. Bouamra, A. Wrotchford, S. Hollis, A. Vail, M. Woodford, F. Lecky, A new approach to outcome prediction in trauma: A comparison with the TRISS model, Journal of Trauma 61 (2006) 701–710. [5] C.R. Boyd, M.A. Tolson, W.S. Copes, Evaluating trauma care: The TRISS method, Journal of Trauma 27 (1984) 370–378. [6] L. Breiman, J. Friedman, R. Olshen, C. Stone, Classification and Re- gression Trees, Chapman and Hall, 1984. [7] K. Brohi, TRISS - Overview and desktop calculator, http://www.trauma.org/index.php/main/article/387/, 2012. [8] M. Chawda, F. Hildebrand, H. Pape, P. Giannoudis, Predicting outcome after multiple trauma: which scoring system?, Injury 35 (2004) 347–358. [9] H. Chipman, E. George, R. McCullock, Bayesian CART model search, Journal of American Statistics 93 (1998) 935–960. [10] H.A. Chipman, E.I. George, R.E. McCulloch, Bayesian treed models, Mach. Learn. 48 (2002) 299–320. [11] G. Clermont, D.C. Angus, S.M. DiRusso, M. Griffin, W.T. Linde- Zwirble, Predicting hospital mortality for patients in the intensive care unit: a comparison of artificial neural networks with logistic regression models, Critical Care Medicine 29 (2001) 291–296. [12] Committee on Trauma. American College of Surgeons, NTDB Version 7.2, http://www.facs.org/trauma/ntdb/ntdbapp.html, 2007. [13] DecisionQ Corporation, Turning data into decisions, http://www.decisionq.com, 2010. [14] D. Denison, C. Holmes, B. Mallick, A. Smith, Bayesian Methods for Nonlinear Classification and Regression, Wiley, 2002. [15] T.G. Dietterich, An experimental comparison of three methods for con- structing ensembles of decision trees: Bagging, boosting, and random- ization, Mach. Learn. 40 (2000) 139–157. 19 [16] S. DiRusso, T. Sullivan, C. Holly, S. Cuff, J. Savino, An artificial neural network as a model for prediction of survival in trauma patients: Vali- dation for a regional trauma area, Journal of Trauma 49 (2000) 220–223. [17] P. Domingos, Bayesian averaging of classifiers and the overfitting prob- lem, in: The 17th International Conference on Machine Learning, Mor- gan Kaufmann Publishers, 2000, pp. 223–230. [18] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, Wiley & Sons, Inc., New York, NY, USA, 2nd edition, 2001. [19] P.J. Green, Reversible jump Markov chain Monte Carlo and Bayesian model determination, Biometrika 82 (1995) 711–732. [20] J.D. Hadfield, MCMC methods for multi-response generalized linear mixed models: The MCMCglmm R package, Journal of Statistical Soft- ware 33 (2010) 1–22. [21] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I.H. Wit- ten, The weka data mining software: an update, SIGKDD Explorations Newsletter 11 (2009) 10–18. [22] J. Hilden, J.D. Habbema, B. Bjerregaard, The measurement of perfor- mance in probabilistic diagnosis, Methods of Information in Medicine 17 (1978) 227–237. [23] A. Hunter, L. Kennedy, J. Henry, I. Ferguson, Application of neural net- works and sensitivity analysis to improved prediction of trauma survival, Computer Methods and Programs in Biomedicine 62 (2000) 11–19. [24] IBM Corporation, SPSS software, http://www- 01.ibm.com/software/analytics/spss/, 2013. [25] F. Jaimes, J. Farbiarz, D. Alvarez, C. Martnez, Comparison between logistic regression and neural networks to predict death in patients with suspected sepsis in the emergency room, Critical Care 9 (2005) 150–156. [26] L. Jakaite, V. Schetinin, Feature selection for Bayesian evaluation of trauma death risk, in: The 14th Nordic-Baltic Conference on Biomedical Engineering and Medical Physics, Springer, 2008, pp. 123–126. 20 [27] P. Kilgo, J. Meredith, T. Osler, Incorporating recent advances to make the TRISS approach universally available, Journal of Trauma 60 (2006) 1002–1009. [28] P. Kilgo, J. Meredith, T. Osler, Injury severity scoring and outcomes research, in: D.V. Feliciano, K.L. Mattox, E.E. Moore (Eds.), Trauma (6th ed), New York, McGraw-Hill, 2008, pp. 223–230. [29] T. Koshy, Catalan Numbers with Applications, Oxford University Press, 2008. [30] J.E. Kreke, A.J. Schaefer, M.S. Roberts, Simulation and critical care modeling, Current Opinion in Critical Care 10 (2004) 395–398. [31] W.J. Krzanowski, D.J. Hand, ROC Curves for Continuous Data, Chap- man & Hall/CRC, 1st edition, 2009. [32] L.I. Kuncheva, Combining Pattern Classifiers: Methods and Algorithms, John Wiley and Sons, Inc., 2004. [33] B. Li, H.F. Lingsma, E.W. Steyerberg, E. Lesaffre, Logistic random effects regression models: a comparison of statistical packages for binary and ordinal outcomes, Medical Research Methodology 23 (2011) 11–77. [34] D. Lunn, A. Thomas, N. Best, D. Spiegelhalter, WinBUGS – a Bayesian modelling framework: concepts, structure, and extensibility, Statistics and Computing 10 (2000) 325–337. [35] MathWorks, Inc, Matlab Statistics Toolbox, version 2012a, http://www.mathworks.co.uk/products/statistics/, 2012. [36] F. Millham, W. LaMorte, Factors associated with mortality in trauma: re-evaluation of the TRISS method using the National Trauma Data Bank, Journal of Trauma 56 (2004) 1090–1096. [37] J.S. Oakland, Statistical Process Control (5th edition), Butterworth- Heinemann, 2002. [38] T. Osler, R. F.B., G. Badger, M. Healey, D. Vane, S. Shackford, A sim- ple mathematical modification of TRISS markedly improves calibration, Journal of Trauma 53 (2002) 630–634. 21 [39] T. Osler, L. Glance, J. Buzas, D. Mukamel, J. Wagner, A. Dick, A trauma mortality prediction model based on the anatomic injury scale, Annals of Surgery 247 (2008) 1041–1048. [40] B.M. Patil, R.C. Joshi, D. Toshniwal, S. Biradar, A new approach: role of data mining in prediction of survival of burn patients, Journal of Medical Systems 35 (2011) 1531–1542. [41] J.R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers, 1993. [42] J. Rasbash, C. Charlton, W.J. Browne, M. Healy, B. Cameron, ML- wiN. Version 2.02. Centre for multilevel modelling, University of Bristol, http://www.bristol.ac.uk/cmm/software/mlwin, 2005. [43] C. Robert, G. Casella, Monte Carlo Statistical Methods, Springer Texts in Statistics, Springer, 2004. [44] F. Rogers, T. Osler, M. Krasne, A. Rogers, E. Bradburn, J. Lee, D. Wu, N. McWilliams, M. Horst, Has TRISS become an anachronism? A com- parison of mortality between the National Trauma Data Bank and major trauma outcome study databases, Journal of Trauma and Acute Care Surgery 73 (2012) 326–331. [45] S. Rogers, M. Girolami, A First Course in Machine Learning, Chapman & Hall/CRC, 1st edition, 2011. [46] SAS/STAT, User guide 9.2. Introduction to Bayesian procedures. Centre for multilevel modelling, SAS Institute Inc., 2009. [47] V. Schetinin, J.E. Fieldsend, D. Partridge, T.J. Coats, W.J. Krzanowski, R.M. Everson, T.C. Bailey, A. Hernandez, Confident interpretation of Bayesian decision tree ensembles for clinical applications, IEEE Trans- actions on Information Technology in Biomedicine 11 (2007) 312–319. [48] V. Schetinin, J.E. Fieldsend, D. Partridge, W.J. Krzanowski, R.M. Ever- son, T.C. Bailey, The Bayesian decision tree technique with a sweeping strategy, in: The International Conference on Advances in Intelligent Systems, IEEE Computer Society, 2004. 22 [49] V. Schetinin, J.E. Fieldsend, D. Partridge, W.J. Krzanowski, R.M. Ev- erson, T.C. Bailey, A. Hernandez, Comparison of the Bayesian and ran- domized decision tree ensembles within an uncertainty envelope tech- nique, Journal of Mathematical Modelling and Algorithms 5 (2006) 397– 416. [50] V. Schetinin, L. Jakaite, Classification of newborn EEG maturity with Bayesian averaging over decision trees, Expert Systems with Applica- tions 39 (2012) 9340–9347. [51] V. Schetinin, L. Jakaite, W. Krzanowski, Bayesian cal- culator, Standalone application for Linux and Windows, http://www.traumacalc.org/bc, 2012. [52] A. Silva, P. Cortez, M.F. Santos, L. Gomes, J. Neves, Mortality assess- ment in intensive care units via adverse events using artificial neural networks, Artificial Intelligence in Medicine 36 (2006) 223–234. [53] E. Steyerberg, A. Vickers, N. Cook, T. Gerds, M. Gonen, N. Obu- chowski, M. Pencina, M. Kattan, Assessing the performance of predic- tion models: A framework for traditional and novel measures, Epidemi- ology 21 (2010) 128–138. [54] A. Stojadinovic, J. Eberhardt, T.S. Brown, J.S. Hawksworth, F. Gage, D.K. Tadaki, J.A. Forsberg, T.A. Davis, B.K. Potter, J.R. Dunne, E.A. Elster, Development of a Bayesian model to estimate health care out- comes in the severely wounded, Journal of Multidisciplinary Healthcare 16 (2010) 125–135. [55] K. Sujin, K. Woojae, W.P. Rae, A comparison of intensive care unit mortality prediction models through the use of data mining techniques, Healthcare Information Research 17 (2011) 232–243. [56] V. Vapnik, Statistical Learning Theory, Wiley, 1998. 23 Appendix. Bayesian Averaging over Decision Trees MCMC implementation of the Bayesian Method When averaging is made over DT models, the Bayesian formalism can be outlined as follows [14]. First we specify the parameters Θ of a DT model we aim to induce from the labelled data D represented by an m-dimensional input vector x. The outcome of a DT model is y = 1, . . . , C, where C ≥ 2 is the number of categories to one of which a DT model assigns a given input x. Given models M1, . . . , ML with parameters Θ1, . . . , ΘL, we can write the desired predictive distribution as an integral over the extended parameter vector Θ = (Θ1, . . . , ΘL): p(y|x, D) = ∫ Θ p(y|x, Θ)p(Θ|D)dΘ = L ∑ i=1 p(y|x, Θi)p(Θi|Mi, D)p(Mi), (2) were p(Mi) is the prior distribution of model Mi, p(Θi|Mi, D) is the posterior density of Θi given model Mi and data D, and p(y|x, Θi) is the posterior predictive density given the parameters Θi. The above integral is analytically tractable only in trivial cases when the distribution p(Θ|D) is known. In practice, we can estimate this distribution by drawing N random samples Θ(1), . . . , Θ(N) from the posterior distribution p(Θ|D), and then we can write: p(y|x, D) ≈ N ∑ i=1 p(y|x, Θ(i), D)p(Θ(i)|D) = 1 N N ∑ i=1 p(y|x, Θ(i), D). (3) The above approximation is achieved with the MCMC method of simula- tion or stochastic integration. The accurate approximation is achieved when a Markov chain becomes a random sequence with a stationary probability distribution. Then according to Eq. (3), we can draw the random samples and calculate the desired predictive density. Figure 5 shows an example of a DT model consisting of two splitting nodes, s1 and s2, and three terminal nodes t1, . . . , t3. The first node, s1, called the root, splits the entire data into two disjoint subsets so that data samples from one subset fall into node s2 via the left branch, and samples from the other subset fall into the terminal node t2 via the right branch. The node s2 further partitions the data samples which fall into the terminals t2 or 24 Figure 5: An example of DT model. The DT consists of two splitting nodes s1, s2 and three terminal nodes t1, . . . , t3. t3 via the left and right branches. Finally one of the terminal nodes assigns the given input to one of the given classes. In general, a binary DT with k terminals consists of (k − 1) splitting nodes, si, i = 1, . . . , (k − 1). The node si, has parameters including: the node position in the DT model, s p i , p = 1, . . . , (k − 1), an input variable svi , v = 1, . . . , m, and a threshold s q i , q ∈ (min(xv), max(xv)). The node si tests the vth variable against the threshold q and assigns the input x to the left branch if xv < q, or to the right one otherwise. A terminal node ti assigns the input x to class c with a probability P ci , i = 1, . . . , k. Consequently, a DT model is described by a vector of parameters, Θ, consisting of two parts. The first part includes the following parameters of nodes si: positions s p i , variables s v i , and thresholds s q i , i = 1, . . . , (k − 1). The second part includes the probabilities P ci , c = 1, . . . , C for each terminal node i, i = 1, . . . , k. DT models whose nodes split data into two disjoint subsets are called binary. The number of possible configurations of binary DTs with k terminal nodes, Sk, is determined by the Catalan number, see e.g. [29]. The number Sk grows exponentially with k and becomes very large for DTs with relatively small k. For example, for k = 25, Sk becomes a number 25 to the power of 12. In practice, to explain data we need to induce DT models of a reasonable size; the size of a DT model is defined by the number of its terminal nodes, k. Oversized DT models are difficult to interpret, and moreover they are prone to overfit data. The size of DT models is dependent on the number of data points, pmin, allowed to be in terminal nodes – setting a smaller pmin increases the size, while setting a greater pmin decreases the size. In most cases, prior informa- tion on the size of DT models is unavailable, and a suitable pmin has to be found empirically. In practice, the size of DT models is unknown or can be given within a range. In such cases, areas of interest (high posterior density of parameters Θ), which have to be explored in Eq. 3, are of variable size, and MCMC has to be extended to Reversible Jump (RJ) proposed in [19]. Prior information about input variables, such as importance of variables x1, . . . , xm, is also often unknown. In such cases, we can assign a variable v for the the node si to be drawn randomly from the uniform discrete distribution, v ∼ U(1, m). Similarly, a threshold q can be drawn from the uniform discrete distribution, q ∼ U(min(xv), max(xv)). It has been shown that the above priors are sufficient in order to build and explore DT models of different configurations within the RJ MCMC method [9, 14]. For binary DT models, the number of possible configurations, Sk, is defined by Eq. ??. From this equation, we see that the larger the k, the larger is the number Sk. So we expect that MCMC algorithm will explore possible DT configurations of size k with probabilities proportional to Sk. RJ MCMC for Averaging over DT Models The RJ MCMC method has been implemented for Bayesian averaging over DT models of variable size [9, 14, 50]. To explore DT models it has been proposed to use the birth, death, change-split, and change-rule moves made with Metropolis-Hastings (MH) sampler. The first two, birth and death, moves were proposed to reversibly change the number of nodes in a DT model (or the dimensionality of the model pa- rameter vector Θ). The third and fourth moves, change-split and change-rule, were aimed at changing the parameters Θ within a current dimensionality. The change-split move replaces a variable v in a chosen DT node si, while the change-rule move modifies a threshold q in node si. 26 The change-split moves are aimed at making large changes in the model parameters in order to potentially increase the chance of sampling from ar- eas of interest. Such moves are intended to disrupt a long sequence of the posterior samples drawn from a local area of interest. In contrast, the change-rule moves are aimed at making small changes in the parameters to let MCMC explore a surrounding area in detail. These moves are made more frequently than the others. The MH sampler starts with a DT consisting of one splitting node whose parameter Θ is assigned within the predefined priors. Making the above moves, the sampler attempts to grow the DT model to a reasonable size by fitting its parameters Θ to the data. The fitness or likelihood of DT models is gradually increased and then becomes oscillatory around some value. This phase, named the burn-in, has to be preset sufficiently long in order to achieve a stationary distribution of the Markov chain. When the Markov chain becomes stationary, the samples of the posterior distribution are collected to approximate the desired predictive distribution – this phase is called post burn-in. The above moves are made with the given proposal probabilities. Their values are dependent on the complexity of a classification problem – more complex problems require larger DT models. To grow such models, the proposal probabilities for the death and birth moves are set to larger values. In general, there is no guidance for setting proper parameters of the MH sampler, and their values have to be found empirically [9, 14, 50]. The proposed change is accepted according to the MH rule, see e.g. [14]: α = min ( 1, p(D|Θp)p(Θp) p(D|Θ)p(Θ) q(Θ|Θp) q(Θp|Θ) ) , (4) where p(D|Θp) and p(D|Θ) are the likelihoods of DT models with the pro- posed and current parameters Θp and Θ, respectively; p(Θp) and p(Θ) are the prior distributions of the parameters; q(Θp|Θ) is the conditional distri- butions of moving from the current parameter Θ to a proposed parameter Θp (so-called transition distribution); and q(Θ|Θp) is the density of the reverse transition. When the birth or death move changes a dimensionality of a DT model, the acceptance rule needs to count a proposal ratio, R. This ratio is de- pendent on the number of possible configurations of DT models, Sk, and so we need to count R to keep the Markov chain reversible during the MCMC 27 simulation. According to [14], the reversibility is kept when the following condition is met: p(Θp|D)q(Θ|Θp) = p(Θ|D)q(Θp|Θ). (5) The above density p(Θ|D) is p(Θ|D) = [ k−1 ∏ i 1 N(svi ) 1 m ] k Sk 1 K , (6) where N(svi ) is the total number of possible splitting rules for variable s v i ; K is the maximal number of terminals allowed for DT models induced from data. The transition distributions in Eq. 5 can be written as follows: q(Θp|Θ) = bk k 1 N(svi ) 1 m , (7) q(Θ|Θp) = dk+1 DQ , (8) where dk and bk are the proposal probabilities of death and birth moves, respectively, and DQ is the number of splitting nodes whose both branches are terminals. For the birth move, the Θp is a (k + 1)-dimensional vector, and therefore the reversibility is kept when Rb is Rb = p(Θp|D)q(Θ|Θp) p(Θ|D)q(Θp|Θ) , (9) The above definitions Eq. 7 and Eq. 8 allow us to rewrite the ratio Rb as follows: Rb = dk+1 bk k DQ+1 Sk Sk+1 . (10) Taking into account that DQ < k and Sk < Sk+1, we see that the ratio Rb ranges between 0 and 1 , Similarly, we can write the ratio Rd for the death move: Rd = bk dk−1 DQ k − 1 Sk Sk−1 . (11) 28 For the above DQ, the ratio Rb : Rb ≥ 1. We can see that the ratios Rb and Rd take different values when a model parameter space is of variable dimensionality. This allows the MH sampler to keep the desired reversibility and explore a parameter space proportionally to the numbers of configurations Sk. Problems of Sampling DT models DT models are multilevel hierarchical structures, as shown in Figure 5. Nodes located at a lower hierarchical level are strongly dependent on the pre- decessor nodes located at upper level. In such hierarchical structures, changes proposed by the MH sampler can significantly redistribute data points falling into DT terminal nodes. The change made in a node close to the DT root is most influential on the distribution. The changes in terminal nodes can be so significant that the likelihood of the DT model is decreased – the closer the node is to the root, the more significant is the change in distribution of data points. In most cases such proposals are rejected. In contrast, a change proposed in a node close to DT terminals is most likely to be accepted as such a change will insignificantly redistribute data samples in the DT terminals. As a result, the MH sampler will only explore limited configurations of DT models [9, 14]. Another problem occurs when the MH algorithm aims to sample large DT models. When a DT model is small and consists of a small number of terminal nodes, the number of data samples falling into the nodes is expected to be much larger than the given minimal number of points, pmin. However, when a DT has grown large, the number of data points is decreased so that further partitions become unavailable. This means that birth moves cannot be made until a death move merges two terminal nodes into one node. As a result the MH algorithm will sample a series of DT models with similar distributions of data samples over terminal nodes. Such series affect the diversity of samples from the posterior distribution and, therefore, the accuracy of approximation of the predictive distribution [14, 48]. Another negative effect is that unavailable moves degrade the given pro- posal probabilities of birth and change moves. When a move is unavailable, the MH algorithm will repeat the current sample, which reduces the diversity of model mixing [14]. In most cases, the number pmin is found from experiments – complex problems typically require a small pmin to allow growth of large DT mod- 29 els. However, an inappropriately small pmin leads to excessive growth of DT models. Growing a DT model, the MH algorithm makes birth moves and almost each birth move increases the likelihood of the model. The MH algorithm accepts these moves and the DT model grows rapidly. The growth of the model continues while the number of data samples in its terminal nodes exceeds pmin and the likelihood of a proposed model remains acceptable. During this period, the dimensionality of the DT model increases rapidly, and the sampler cannot explore the posterior within each dimensionality in detail. It is unlikely that samples will be drawn from areas of highest posterior density [9, 14]. The growth of DT models is typically monitored, and the modeller can reduce excessive growth by increasing pmin as well as by setting a smaller value of the proposal probability for the birth moves. To mitigate the negative effect of fast growing DT models, Chipman et al [9] have proposed a restarting strategy. This strategy allows a DT model to grow within a limited period in multiple runs. The average over all models grown in these runs produces a better approximation accuracy when the duration of the growth period and the number of the runs are properly set. A similar idea of restricting the growth of DT models has been proposed by Denison et al [14]. The growth is restricted within a given interval to allow the MH sampler to explore a model parameter space in detail. Both strategies require additional settings for the MH sampler, which have to be found experimentally. Sweeping Strategy As an alternative to the restricting strategies, the RJ MCMC method could be modified so as to reduce the number of replications of samples from the posterior density. In our previous work [48], we proposed a sweeping strategy aimed at reducing the number of unavailable moves. For making a change-split move, the sweeping strategy assigns a new variable xv, v ∼ U(1, m), and a threshold q: q ∼ U(a, b), (12) where U(a, b) is a uniform distribution on the interval between a = min(xv,j) and b = max(xv,j) defined by Np data points falling into the chosen node, where j = 1, . . . , Np. 30 For making change-rule moves, a new threshold q′ is drawn from a re- stricted Gaussian distribution: q′ ∼ N ′(q, σ2, a, b), (13) with mean q and given proposal variance σ2 on the interval (a, b). The proposed move can be made so that one or more terminal nodes in a DT model will contain fewer data points than pmin. If this happens in terminal nodes with a common parent node, these terminals are recombined into one terminal node, and the MH sampler counts such a move as a death move. If however there are two or more such terminals with different par- ents, the algorithm will assign the proposal unavailable in order to keep the reversibility of the Markov chain. Similarly to a change move, a birth move assigns a new splitting node with parameters drawn from the given prior. A new splitting variable xv is drawn from a uniform distribution, v ∼ U(1, m), and a new threshold q is assigned as described by Eq. 12. In our experiments, we observed that a MH sampler using the above prior on change moves proposes fewer unavailable moves and, therefore, the sampler accepts fewer replications of a current parameter vector Θ. Taking this into account, we hypothesise that a reduced number of the replications collected during the MCMC simulation will improve the diversity of model mixing. In support of this hypothesis, in our previous experiments [49] on the benchmark problems, we observed that the MH sampler using the above prior significantly reduced the dimensionality of parameter vector Θ as well as the uncertainty in estimates of predictive density. The above strategy, named sweeping in [48], is applied to the Markov chain in both burn-in and post burn-in phases. Let a MH sampler make the birth, death, and change moves as described in Section 9. Then we can describe the main steps of the sweeping strategy as follow. • Birth move: 1. Select a random terminal node i ∼ U(1, k) and count the number of data samples, p, in this node 2. If p > 2pmin then assign a variable, v ∼ U(1, m) and threshold given by Eq. 12 to a new splitting node 31 3. Count the numbers of data samples, p1 and p2, split by the new node 4. If (p1 ≥ pmin)&(p2 ≥ pmin) let the MH sampler check the accep- tance of the proposal • Change move (change-split or change-rule): 1. Select a random splitting node i ∼ U(1, k−1) and read its variable v and threshold q 2. For change-split assign a new variable, v′ ∼ U(1, m) 3. For change-rule assign a new threshold q′ defined by Eq. 13 4. Apply the proposed change to the DT 5. Count the number of terminal nodes, n0, with pi < pmin 6. If n0 == 1, then apply the death move 7. If n0 > 1, then assign the proposal unavailable and draw a new sample 8. Let MH sampler check acceptance of the proposal Here, n0 = ∑k i I(pi ≤ pmin), where I(·) = { 1 if pi ≤ pmin, 0 otherwise. Setting for the Metropolis-Hastings Sampler The settings include the following parameters of the sampler: 1. the proposal probabilities for birth, death, change-split, and change- rule moves, Pr 2. the proposal distribution, a Gaussian distribution with the zero mean and standard deviation s 3. the numbers of burn-in and post burn-in samples, nb and np, respec- tively 4. the sampling rate of the Markov chain, sr 5. the minimal number of data points allowed in terminal nodes, pmin The parameters Pr, s, and nb were the most significant factors that im- pact on the convergence of the Markov chain, and so we tested a number of variants of these parameters to achieve an acceptable convergence. The sampling rate sr was used to elevate the independence of samples drawn from the Markov chain during the post burn-in phase. At the second stage, the 32 settings s and pmin have been refined to let the MCMC algorithm efficiently sample the posterior distribution of Θ keeping the size of DT models rea- sonably small; the efficient sampling is achieved when the acceptance rate ranges between 0.25 and 0.5. The use of such a two-stage technique allowed us to reduce the number of possible combinations of the settings to a realistic number not exceeding 10. The best results were obtained with the following settings. The proba- bilities for the birth, death, change-split and change-rule moves were Pr = (0.2, 0.2, 0.1, 0.5), respectively. The proposal distribution was a Gaussian with s = 1.0. The numbers of samples were nb = 20, 000 and np = 5, 000, and the number of data points was pmin = 10. 33