key: cord-0498833-1pbl6gmw authors: Tohka, Jussi; Gils, Mark van title: Evaluation of machine learning algorithms for Health and Wellness applications: a tutorial date: 2020-08-31 journal: nan DOI: nan sha: e528052f7dfa598b11d4a0402607cbc514877724 doc_id: 498833 cord_uid: 1pbl6gmw Research on decision support applications in healthcare, such as those related to diagnosis, prediction, treatment planning, etc., have seen enormously increased interest recently. This development is thanks to the increase in data availability as well as advances in artificial intelligence and machine learning research. Highly promising research examples are published daily. However, at the same time, there are some unrealistic expectations with regards to the requirements for reliable development and objective validation that is needed in healthcare settings. These expectations may lead to unmet schedules and disappointments (or non-uptake) at the end-user side. It is the aim of this tutorial to provide practical guidance on how to assess performance reliably and efficiently and avoid common traps. Instead of giving a list of do's and don't s, this tutorial tries to build a better understanding behind these do's and don't s and presents both the most relevant performance evaluation criteria as well as how to compute them. Along the way, we will indicate common mistakes and provide references discussing various topics more in-depth. Data-driven approaches for healthcare decision support, such as those driven by Machine Learning (ML), have seen a surge in interest over recent years, partly driven by the promising results that a 'reborn' AI research branch has generated. As the name says, these approaches rely on the availability of data to extract knowledge and train algorithms. This is opposed to, e.g., modeling approaches in which physiological, physics-based, mathematical, and other equations form the basis of algorithms, or, rule-based systems in which reasoning processes are obtained by translating domain-experts' knowledge into computer-based rules. Focusing on data-driven systems, the data plays a role in several components during the development and actual usage phases. First, we need data to extract knowledge from, i.e., to develop and train algorithms so that they learn-by-example the properties of the problem at hand and get better at solving the problem by repeatedly providing example data. Second, we need to monitor during the development phase how promising the algorithms are and make choices, e.g., concerning optimisation of parameters or choosing different ML-paradigms. Methods that don't perform well at all can be discarded, and ones that seem promising can be further optimised. To assess how promising a specific method is, we need to examine how it performs on data that was not used during training. Finally, to objectively assess how well the final 'best' system performs, we need to apply completely new data to it that has not been used at all thusfar during the research and development process. Thus, there are at least three stakeholders that have the interest to get as large part of the data pie as possible. 1) the algorithm developer, to get as good method development as possible; 2) the validation assessment responsible, who would help to steer the development process as good as possible; and 3) the decision-maker who wants to have as accurate as possible assessment of the merits of the system. Thus, we need to make a trade-off and think about efficient usage of the precious data, and need to carefully define what we truly mean with performance evaluation and what kind of measures or yardsticks would be appropriate when. It is the aim of this paper to provide practical guidance on how to assess performance reliable and efficiently and avoid common traps. We concentrate on the challenges when dealing with decision support in the health-domain, based on experiences gained in over 25 years of experience in the field. Decision support in healthcare can mean many things. Perhaps the most common example is that of computer-based systems helping doctors to diagnose a disease based on, e.g., medical images. This is a classification task (diseased vs. healthy-case, or disease A vs. disease B). In this case, the users are healthcare professionals (medical doctors or radiologists). Other common decision-making tasks include risk assessment (the risk of developing a disease), predicting hospital resource needs, treatment outcome, helping to plan interventions (surgery or treatment plans), monitor a patient state over time to see if a treatment has success. From a data science point of view, there are many paradigms that can contribute to decision support tools. Typical approaches include classification methods, regression methods, and more explorative clustering approaches. In this paper, we concentrate primarily on classification and regression approaches as they are an essential part of most decision support systems in practice. Clustering approaches are often more related to research itself, often involving visualisation, and accurately quantifying their 'performance' in terms of numbers is often of less relevance. Next to performance per se, there are many other criteria that influence whether an algorithm will find successful uptake in healthcare practice. Possible criteria to fulfill in order to have successful uptake of methods in practice: • performance (classification) accuracy, • usability • ease of integration in existing processes • robustness (e.g., deal with missing and poor quality data) • explainability (no black box) • cost-effectiveness In this paper, we concentrate especially on the performance measures, due to its crucial role in, e.g., diagnostics, risk assessment and treatment planning. However, the other criteria are crucial as well and would deserve separate discussion in dedicated tutorials. This section reviews the background on supervised classification from a more formal viewpoint and introduces the notation. For clarity, we will focus on classification problems, but the treatment of regression problems would be very similar. For this tutorial, this section sets the scene by introducing the Bayes classifier, exemplifies how the Bayes classifier is approximated, illustrates how the Bayes classification rule depends on the prior probability of classes, and informs about sampling issues when training discriminative classifiers. A classification task is to assign an object (in health applications, usually a person) described by a feature vector x = [x 1 , . . . , x d ] (e.g., measured blood pressure levels, total cholesterol, age, sex) to one of the c classes (e.g., cardiovascular disease in the future or not). The c classes are denoted simply as 1, . . . , c. In machine learning, a classifier is represented by a function α that takes as an input a feature vector and outputs the class. In more practical terms, a classifier outputs c values representing each class, and the class of the highest (or lowest) value is the one selected. In supervised learning, this function α is constructed based on training data, which is given by pairs (x i , y i ), where x i is the feature vector of an object belonging to the class y i ∈ {1, . . . , c}. We assume to have N such pairs, constituting the training data {(x i , y i )|i = 1, . . . , N }. The classification problem is statistical by nature: two objects having equal feature vectors can belong to different classes. Therefore, it is not possible to derive a classifier that works perfectly i.e., always classifies every object correctly. For a specified classification problem, the task is to derive a classifier that makes as few errors (misclassifications) as possible, among all possible objects to be classified. This is known as the Bayes classifier. The Bayes classifier is a theoretical construct useful to study the classification problem, but it needs the complete and accurate the statistical characterisation of the problem. In practice, this characterisation is not known and the classifier must be learned based from training data. However, nearly all the practical classifiers approximate the Bayes classifier in some sense. To build the Bayes classifier, we assume to know the 1. prior probabilities P (1), . . . , P (c) of the classes and 2. the class conditional probability density functions (pdfs) p(x|1), . . . , p(x|c). The prior probability P (j) defines what percentage of all objects belong to the class j. The class conditional pdf p(x|j) defines the pdf of the feature vectors belonging to j. Obviously, c j=1 P (j) = 1. The Bayes classifier is defined as where P (j|x) is the posterior probability of the class j being the correct class for the object with the feature vector x. 1 In other words, the Bayes classifier selects the most probable class when the observed feature vector is x. The posterior probability P (j|x) is evaluated based on the Bayes rule, i.e., P (j|x) = p(x|j)P (j) . However, p(x) is equal for all classes and it can be dropped. The Bayes classifier can now be rewritten as i.e., the Bayes classifier computes the product of the class conditional density at x and the prior for class j. By its definition the Bayes classifier minimizes the conditional error E(α(x)|x) = 1 − P (α(x)|x) for all x. Because of this and basic properties of integrals, the Bayes classifier minimizes the classification error In the above equation, it is important to note that the integration is over all possible feature vectors, not just those in the training set. This is the generalization error that we are interested in estimating in the coming sections. The classification error E(α Bayes ) of the Bayes classifier is called the Bayes error. It is the smallest possible classification error for a fixed classification problem. As mentioned previously, the Bayes classifier is a theoretical construct: in practice we never know the class conditional densities p(x|j) or class priors P (j) We remark that the definition of the Bayes classifier does not require the assumption that the class conditional pdfs are Gaussian distributed. The class conditional pdfs can be any proper pdfs. It is important to understand the role of the prior probabilities when designing classification rules or considering the strength of evidence. This equals to the understanding the Bayes formula. We give here a brief example, which is summarized from 2 and 3 , but similar examples appear in various of statistics text books. The example is based on the following scenario: • 1% of women at age forty who participate in routine screening have breast cancer, i.e., P (Cancer+) = 0.01, and therefore 99% do not, i.e., P (Cancer−) = 0.99; • 80% of mammograms detect breast cancer when it is there (and therefore 20% miss it), i.e. P (T est+|Cancer+) = 0.80,P (T est−|Cancer+) = 0.20; • 9.6% of mammograms detect breast cancer when it is not there (and therefore 90.4% correctly return a negative result), i.e., P (T est+|Cancer−) = 0.096,P (T est − |Cancer−) = 0.904 . A woman in this age group had a positive mammography in a routine screening. What is the probability that she actually has a breast cancer? The correct answer is 7.8%, which can be quite counter-intuitive at the first sight. The answer is obtained based on the Bayes rule. First, note that the question asks for the posterior probability of breast cancer given that the test was positive P (Cancer + |T est+). This probability was not provided above and it must be computed based on the Bayes rule: Note that P (T est+) = P (T est+|Cancer+)P (Cancer+)+P (T est+|Cancer−)P (Cancer−). 2 http://yudkowsky.net/rational/bayes 3 https://betterexplained.com/articles/an-intuitive-and-short-explanation-of-bayes-theorem/ The training data may be sampled in two distinct ways and it is important to make a distinction between these. In mixture (or random) sampling, the training data is collected for all classes simultaneously conserving the classproportions occurring in real-life. In separate sampling, the training data for each class is collected separately. For the classifier training, the difference of these two sampling techniques is that for the mixed sampling we can estimate the priors P (1), . . . , P (c) asP where n j is the number of samples from the class j. On the other hand, the prior probabilities cannot be deduced based on the separate sampling. Most sampling in biomedicine and life sciences is separate sampling whereas most standard textbooks assume mixture sampling. A good overview of the challenges caused by the separate sampling is [1] . This subsection briefly explains how to construct classifiers based on the training data {(x i , y i )|i = 1, . . . , N }. We can approach this problem in several ways. The conceptually simplest is via generative plug-in classifiers. These approximate the prior probabilities P (j) and the class conditional densities p(x|j) by estimatesP (j),p(x|j) found based on training data and substitute these estimates to the formula of the Bayes classifier. The estimates for class conditional pdfs can be either parametric (e.g., Naive Gaussian Bayes, Discriminant analysis) or non-parametric (Parzen densities, mixture models). These classifiers are called generative because they build a probabilistic model of the classification problem that can be used to generate data. As an example, we consider the Gaussian Naive Bayes (GNB) classifier for a two class classification problem. Training data is {(x i , y i )|i = 1, . . . , n}, where each y i is either 1 or 2. For convenience, we define D 1 = {i|y i = 1} and D 2 = {i|y i = 2}, the indexes of the training samples belonging to the class 1 and class 2, respectively. Moreover, let n 1 and n 2 be the number of samples in classes 0 and 1, so that n = n 1 + n 2 . For GNB, we make the assumptions that 1) data in each class is Gaussian distributed and 2) each feature is independent from each other feature. Denote where d is the number of features. The classifier training consists of 1. Computing the mean for each feature k = 1, . . . , d, and each class j = 1, 2: m jk = (1/n j ) i:y i =j x ik . 2. Computing the variance for each feature k = 1, . . . , d, and each class j = 1, 2: 3. Computing the estimatesP (1) = n 1 /n andP (2) = n 2 /n. After these computations, the class for a test sample z = [z 1 , z 2 , . . . , z d ] can be computed by computing two discriminant values: G(z k ; m jk , s jk )]n j /n, and picking the class producing the larger discriminant. The function G(·; m, s) denotes Gaussian probability density with the mean m and variance s, i.e., G(z; m, s) = 1 √ 2sπ e −(z−m) 2 /2s . Most of the modern learning algorithms construct discriminative classifiers. These do not aim to construct a generative model for a classification problem, but try to more or less directly find a a classification model that minimizes the number of misclassifications in the training data (see [2] for a more elaborate definition of a discriminative classifier). For example, classifiers such as support vector machines, Random Forests, and neural networks belong to this class of classification algorithms. During the learning process, the classifier is defined by a set of parameter values w, i.e., the classifier is a function α(x; w), where the parameter vector w is to be learned during the training. The training is typically done by optimizing a cost function (or a loss function) f that is somehow related to the desired optimality criterion. To give a simple example, consider a two-class problem, where the classes are 0 and 1. Then, a simple but widely used loss function is L2-loss defined as This loss function approximates the number of misclassifications in the training set. The number of misclassifications cannot be used as a loss function as it not continuous and thus intractable to optimize. Discriminative classifiers are powerful and often preferred over the generative ones. However, their formulation poses two related problems in biomedical and life science applications, where the sizes of training sets are typically small. First, it may not be clear which optimality criterion the loss function approximates and if the loss function approximates the correct optimality criterion. Second, in the case of separate sampling, it must be taken into account that the prior probabilities might not be correct [1] . These two problems are not that severe with generative classifiers. The performance of classifiers can be measured in many ways. All these ways are related to the number of times the classifier was 'correct' or 'wrong' when processing new inputs in real-life usage. However, different usage scenarios introduce different views on what we mean with 'correct' or 'wrong', and hence we have a range of different performance measures. Performance estimates can rarely be used directly as optimization criteria for discriminative classifiers as they are not continuous nor differentiable and thus cannot be efficiently optimized. However, as stated in Section 2.5, the loss function should approximate the desired performance measure. The simplest way of assessing performance is by calculating the number of errors, or, correct classifications, that are made on a given set of data. It is clear that if the number of errors is large the performance is not good. The accuracy can be defined as a simple ratio: accuracy = number of correct classifications total number of samples to classify . A high accuracy (close to 1, or, 100%) might indicate that we have a useful classifier. However, this is not always certain, especially not if the prevalences of the different groups (e.g. diseased vs. healthy cases) are imbalanced. If, e.g., we have the situation where we have a test test of 1000 cases, with 1 case being a 'disease case' and 999 are 'healthy cases', then we can make a simple classifier that always classifies any case as 'healthy case'. It would have an impressive accuracy of 999/1000 = 99.9%, but be useless in practice, since its ability to correctly detect actual disease cases (which we call sensitivity) is zero. For this reason, we advance from the simple accuracy measure to a more in-depth look to quantify how well disease (and healthy) cases are classified separately. False Negatives True Positives The confusion matrix is a useful tool for quantifying the performance of a classifier on different classes. By convention the rows of the matrix contain the ground truths and the columns the classification results. Matrix element (i,j) reflects the number of cases that belong to class i, and were classified as belonging to class j. An example is given in Table 1 . In this case we have a 2-class problem (class 1 = 'healthy', class 2 = 'disease'), and accordingly have a 2x2 matrix. On the rows are the true classes, 'actual healthy' and 'actual disease', as ground truth (e.g., as observed from a confirmed clinical endpoint/diagnosis), and in the columns are 'classified healthy' and 'classified disease'. We can see that there were 116+5= 121 healthy cases in the set; 116 of them were correctly classified as healthy (we call these True Negatives (TN)), 5 erroneously as disease (false alarms) (False Positives (FP)). Also, there were 12+23 = 35 disease cases, 23 of them got correctly classified (True Positives (TP)) and 12 wrongly assessed as belonging to the healthy group (False Negatives (FN)). A perfect classifier would have all non-zero values on the diagonal, and zeroes everywhere else. We can see that the classifier overall works quite ok, but especially the detection of disease cases (12 out of 35 classified wrongly) is not so great. Exploring the confusion matrix is thus highly useful to get understanding about what kind of errors are being made on what classes. We can derive several simple quantities from the matrix that succinctly capture its main properties. 3.3. Classification rate, sensitivity,and specificity, FPV, PPV, precision and recall and F1 score Performance of classifiers can be quantified by the following measures, see also Table 2 . All of them have values in the interval [0, 1]. accuracy: this is the total of correctly classified cases divided by the total number of cases in the test set as in Eq. (5). In our example it is 0.891. sensitivity: is the number of disease cases (that are by convention called class positive) that were correctly classified, divided by the total number of disease cases, i.e., T P T P +F N . It thus quantifies how well the classifier is able to detect disease cases from the disease population, and thus appropriately raises an alarm. A low sensitivity implies that many disease cases are missed. In our case the sensitivity is 0.657 specificity: is the number of healthy cases (that are by convention called class negative) that were correctly classified as such, divided by the total number of healthy cases, i.e., T N T N +F P . It thus quantifies how well the classifier is able to detect healthy cases from the healthy population, and thus appropriately stays quiet. A low specificity implies that many healthy cases are classified as disease case, and many false alarms are generated. In our case the specificity is 0.959. positive predictive value (PPV): is the number of cases that actually have a disease divided by the number of cases that the classifier classifies as having a disease, i.e., T P T P +F P . It thus is a probability-related measure that indicates how probable it is that a case has a disease when the classifier has a positive/disease class as output. Or, more popularly, how much one should believe the classifier when it indicates that the person has a disease. In our case it is 0.821. negative predictive value (NPV): is the number of cases that actually are healthy divided by the number of cases that the classifier classifies as being healthy, i.e., T N T N +F N . It thus is a probability-related measure that indicates how probable it is that a person is healthy when the classifier has a negative/healthy class as output. Or, more popularly, how much one should believe the classifier when it indicates that the person is healthy. In our case it is 0.906. Sensitivity and specificity are perhaps the most common measures in clinical tests and wider healthcare contexts, when we talk about the performance of (diagnostic) tests or patient monitoring settings. In a wider application area, also the following measures are used. Precision: is the number of cases that actually belong to class X divided by the number of that the classifier classifies as belonging to class X. If we have a two-class classification problem (like in our example), precision is identical to positive prediction value. Recall: is the number of cases belonging to class X that were correctly classified as class X, divided by the total number of class X cases. In a twoclass setting it is identical to sensitivity. Precision and Recall can be applied to more-than-two class problems, which explains their widespread use in, e.g., reporting of performance of machine learning algorithms. The F1-score is the harmonic mean of precision and recall: If either the precision or the recall have small values, the overall F1 score will be small. It thus provides a more sensitive measure than accuracy. A word of caution regarding PPV and NPV: their use is common, partly because of the intuitivity of the measure (if my classification algorithm says I have a disease, how much should I believe it). However, it should be kept in mind that PPV and NPV are not only dependent on the performance of the classifier per se, but are also dependent on how many cases of different classes are present in the datasets. This relates to the prevalences of different classes, or prior probabilities, and the earlier example regarding the breast cancer screening (where we ended up with a surprisingly low PPV of 7.8%) It can be further illustrated with a simple example, see Table 3 and 4. The NPV and PPV are influenced by the ratio of disease and healthy cases that happen to be in the test set. If the number of disease cases is high, then also the PPV tends to be high. This is intuitively understandable (if a diseases prevalence is high, it is easier to believe the classifier when it classifies a case as disease, then it would be when the disease would be rarely occurring). Thus the PPV and NPV are influenced by both the classifier performance and the number of cases of different classes in the test set. They should thus not be used to compare classifiers performances when those performances have been derived from different datasets. Sensitivity and specificity do not suffer from this problem. Different prevalences can (and are likely) to occur when we deal with datasets that have been collected at different centres, in different geopgraph-ical locations with different processes etc. Another, more technical reason why prevalences may get affected is when training sets are artificially being balanced when a certain class is underrepresented. Training a classifier on a class that has only a few instances may be difficult, and one way to deal with that is by repeating/copying the few rare disease cases in the set to alleviate training. Thus we are artificially increasing prevalence, and, as a consequence, PPV. The examples above relate to two-class problems, but in reality we often have cases where we have more than two classes. For example, classifying different patient states/disorders in an ICU, or in differential diagnostics. We already saw that precision and recall naturally are defined for any number of classes. Sensitivity and specificity can be generalized to fit to a multi-class problem by grouping classes. In those cases a 2-class setting is generated by classifying one class vs all other classes together. As we saw, the accuracy-measure does not give a good insight in the overall usefulness of a classifier. Sensitivity, specificity, PPV, and NPV provide better insights. If one wants to summarise the performance in one single number, the balanced accuracy can be used as alternative to the regular accuracy. For a 2-class classifier, it can be calculated as the average of sensitivity and specificity. For the general case it is the average of the proportion of correct classified cases for each class individually. If the classes are balanced, e.g., there are as many cases in the disease group as there are in the healthy group, then the balanced and regular accuracy are equal. However, in cases where the number of cases for the different classes are not the same (which is very common in healthcare settings), balanced accuracy gives a more appropriate estimation of overall accuracy. In the two-class classification problem there is often a trade-off between having a high sensitivity (detect all persons who have a disease) versus high specificity (avoid false alarms, detect all persons who are healthy). Usually the classification is done by having a classifier output evaluated and use a threshold on it (or 'decision criterion') to decide whether to assign the input data to class 0 or class 1. The value of the threshold defines then what the values of sensitivity and specificity are. An illustration of the problem is given in Figure 1 . If we move the threshold (criterion value in Fig. 1 ) to low values (and classify all persons who have a classifier output higher than that low value as having a disease), all persons with a disease would be detected (sensitivity = 1). However, also many healthy persons would be classified as having a disease (false alarms, specificity is low). The opposite is also true -a high threshold leads to high specificity but low sensitivity. Selecting the best threshold is thus a trade-off between sensitivity and specificity. The problem can be visualized in the so-called ROC curves (Receiver Operating Characteristic), which have been around since the WWII, but have been introduced into the medical field since the 1970s. They are a plot of 1-specificity on the x-axis vs sensitivity on the y-axis (in some research disciplines, these axes labels are sometimes labelled as FPR (false positive rate) on the x-axis, and TPR (true positive rate) on the y-axis). An example of an ROC plot is given in Figure 2 . It can be see that any threshold value below 0.72 has a sensitivity < 0.6 Figure 2 : An example of an ROC curve. In this case that of a classifier (Monophasic Linear Detector) that is trained to classify brain activity: monophasic EEG vs normal EEG. The numbers along the curve are different thresholds/criterion values to make the final classification. More information can be found in [3] and any value above 0.8 has a specificity < 0.6. It can e.g. be seen that at certain places a relatively small increase in threshold (from 0.79 to 0.82) leads to a big effect in specificity (drop from 0.78 to 0.4). This type of explorations helps to make informed decisions on the threshold settings. The point (0,1) would give the ideal classifier, with both sensitivity and specificity having a value of 1. Points on the curve closest to (0,1) would be the best classifier, but it has to be kept in mind that usually there is a preference for either the sensitivity or specificity (or both) to be in a certain region (sometimes called 'clinically useful region'), and either of them may be given relatively more importance. Thus, the curve provides a tool to explore the merits of different thresholds. Another common use for the ROC curve is to calculate its area-underthe-curve (AUC), and use that as performance measure for the classifier. It is a number between 0 and 1, with a value of 1 indicating that the classifier will classify a randomly presented case always correctly. A value of 0.5 indicates the classifier is not better than random guessing (and traditionally the diagonal is plotted in the figure as well, indicating a random classifier for comparison). A useful classifier should have an AUC (significantly) higher than 0.5. How high an AUC should be for it to be good enough is applicationdependent. For some situations an AUC of 0.7 might already be very good, for others 0.95 might be still rather poor. And, in some cases a lower AUC might be clinically acceptable if ,e.g., the sensitivity is high. AUC is convenient because it provides one single number to describe overall performance, and as such, can be used to compare different classifiers against each other. However, it should be kept in mind that it says nothing about the clinical usefulness. A good recent overview of the discussion can be found in [4] . In there, it is also described how the ROC curve in fact can be derived from the pdf of the classifier outputs for different classes. The ROC curve may give us a tool to find optimal threshold values in theory: a point on the curve as close as possible to (0,1) (equal to 1-specificity = 0, sensitivity = 1)). However, there are many trade-offs to make when considering the balance between sensitivity and specificity. For example: the relative costs associated with acting upon a classified disease (when it is a false classification), or the costs associated with not acting in case the disease is actually present. Or, the discomfort to a patient associated to a possible intervention (or discomfort when the disease is not treated) etc. Differences in costs for false positives vs false negatives may give rise to reconsideration of the position of the optimal threshold to choose. Additionally, prevalence of the disease plays a role. A good practical discussion of the ROC in healthcare settings can be found at 4 There are 3 approaches to find 'optimal' thresholds (see also, 5 ) • Calculate the point on the ROC curve that has minimum distance to (0,1). This assumes that sensitivity and specificity are of equal importance. It is easy to implement in an algortihm: calculate the distance for each point on the curve, and choose as optimal the threshold the point with the smallest distance. • The second approach uses the logic that the point on the ROC curve that is at the largest vertical distance from the diagonal represents the optimal threshold. Informally, this could be motivated by saying that, since points on the diagonal represent a 'random classifier', points far away from this would represent better classifiers -the further the better. If we consider for a given x-co-ordinate (1-specificity), the yco-ordinate of the points on the ROC curve the sensitivity, and at the diagonal (1-specificity), then the vertical distance is sensitivity -(1-specificity) = sensitivity + specificity. This is called the Youden index, J. Optimizing this value thus gives a threshold with the best combination of sensitivity and specificity (or a maximum 'balanced accuracy'). Alternatively, it can be looked upon as maximisation of the difference between sensitivity and false positive rate. Again, we assume equal importance for both. • Finally, estimating the optimal threshold based on costs (cost -minimisation) would deliver the value that can be expected to yield the highest benefit in the real-world. This does not assume that sensitivity and specificity are equally important. The issue is that the costs associated to misclassifications are highly diverse and difficult to estimate as they originate from external processes (e.g. treatment processes), personal patient circumstances (co-morbidities, social interactions, occupation), or local considerations (e.g. costs of tests, reimbursement policies). Costs are either direct or indirect and may be incurred at different time scales. Thus, there is no easy recipe as there was in the first two approaches, and the effort needs to be seen more as part of wider cost-effectiveness and impact-effectiveness studies, which are major disciplines by themselves. Intuitively, it can be understood that, if the cost of missing a disease diagnosis is high, and intervention (even not-needed intervention of a person who is in reality healthy) is safe and cheap, then the best thresholds can be found at the right toparea of the curve: high sensitivity and accepting a high number of false positives. On the other hand, if an intervention carries high-risk and we are not convinced by its effectiveness, the threshold will be in the left bottom corner: we minimise harming non-diseased people, but take missing diseased persons for granted. In many cases in healthcare settings, sensitivity is prioritised over specificity when it comes to detecting critical patient states. However, it is good to keep in mind that false positives are a major burden, e.g., in critical care patient monitoring, leading to 'alarm fatigue' and potentially enormous costs [5] minimising false alarms is a main objective in many medical equipment R&D efforts. Another recent example can be found in the context of antibody testing for Covid-19. A false positive result may wrongly suggest that a person is 'safe' and can interact more freely with others, with potentially disastrous consequences. It is worth mentioning several other performance measures that are commonly used. • The Youden index (or Youden's J statistic) is defined as [6] J = sensitivity + specif icity − 1. Expressed in this way, it is obviously equivalent with the balanced classification rate (or, balanced accuracy). However, the Youden index is often used as the maximum potential effectiveness of a biomarker: where c is a cut-off point [7] . It takes values between −1 and 1. • The Dice index [9] is widely used in the evaluation of image segmentation algorithms as it effectively ignores the correct classification of negative samples (the background region). The Dice index is defined as 2T P 2T P +F P +F N . It has close connections to Cohen's Kappa [10] and it is equivalent to Jaccard coefficient (sometimes termed Tanimoto coefficient) [11] . In this subsection, we will briefly outline the most important performance measures when facing a regression task, that is, when the variable to be predicted is real valued, instead of categorical. The majority of machine learning algorithms for regression problems aim to minimize the mean squared error (MSE): where y i is the correct value for target i andŷ i is our prediction of it. It puts more emphasis on bigger errors more than on smaller ones (which makes sense in many real life applications, it treats positive and negative errors equally (also acceptable in many cases), and the square-function is mathematically convenient for many optimisation algorithms. A related measure is mean absolute error (MAE): which is less 'punishing' to large errors. MSE and MAE report the error in the scale and quantity of the original variables, which makes them sometimes hard to interpret and application dependent. Easier to interpret alternatives are (Pearson) correlation coefficient The coefficient of determination (sometimes termed normalized MSE) is defined as Note that in contrast to explanatory modeling, in predictive modeling, the coefficient of determination can take negative values, and it is not equal to correlation squared [12] . This is why the notation Q 2 is recommended instead of R 2 . The above described performance measures (sensitivity, specificity, AUC of ROC etc) all give certain values based on the specific data and algorithm that have been used. How accurate such a value is as estimation of the performance on the general population is an important question. It may be intuitively be expected that a sensitivity of 0.9 as calculated from a dataset with 10000 cases is more accurate than one from 10 cases. Also, how do we compare classifiers, and test e.g., whether algorithm A is 'significantly better' than algorithm B, based on the AUC? To estimate the standard error in sensitivity and specificity, different approaches exist. They all relate to the calculation of the confidence interval of the binomial proportion (see, e.g., 6 ). The simplest implementation is the asymptotic approach, which holds if the number of samples is very large. Other, more refined versions use the Wilson score interval or the Clopper-Pearson interval. The confidence interval for the AUC is not trivial to calculate as it requires assumptions about the underlying distributions. Just calculating the mean and standard deviation from a number of pooled AUC observations is not appropriate as the distribution of the ROC is not inherently normal, and the 'samples' (AUC observations) are not independent, since the underlying data remains constant. An authoritive paper from 1982 by Hanley and McNeil [13] gives estimations that are relatively conservative and based on assumptions of exponentialility underlying score distributions. An alternative is to determine instead the maximum of the variance over all possible continuous underlying distributions with the same expected value of the AUC, which gives an unpractically loose estimation. Cortes and Mohri [14] present an approach which is distribution-independent and thus wider applicable. A convenient calculator for the confidence interval of various measures can be found at the "Diagnostic test evaluation calculator" webpage 7 . There can be many reasons why the performance measures are not 'exact'. The abovementioned confidence intervals are based on taking into account natural random variation in the observations, but the variation in the observations can have many underlying reasons, and may not be random at all. Some reasons why we cannot assess performance measures exactly include: • Lack of Gold Standards: in the above discussions we implied that we knew to which class (0 or 1) a person belonged, and what the algorithms 'correct' answer was supposed to be. However, in many cases the situation is not that clear-cut. A 100% final diagnosis for a form of dementia might be available only once pathology research can be performed once the patient has deceased. Thus, if we classify that person's data while she is alive there may be a change that the 'correct reference' is not actually correct -influencing our performance estimates. For many patient states (e.g., awareness, anesthesia, pain etc.) we have scales that are not absolute but have been accepted as 'good enough' for practical use. All classification estimates on such scales are thus inherently fraught with some uncertainty margin due to the fact that we simply do not know the exact right answer. • Inter-expert variability: to develop and train algorithms data needs to be labelled and assigned to classes. This is typically done by experts in the field. In many of the more complex diagnostic there is room for interpretation, expert A might come to a different diagnostic conclusion than expert B (based on earlier experience, processes etc.). In that case the question arises, should we develop a classifier that matches expert A as good as possible, or expert B (or C)? Or take the average of both? For many datasets the reference data labels have been created by having the different experts discuss with each other and come to a comprise labelling that all agree with. Another approach is to quantify the agreement-level between the expert opinions and use that as performance target for the classifier. • Limited representativeness of the development and test data: if data has been collected in one hospital only, and is being tested on completely independent data from the same hospital, it may be so that the performance when applied to data from other hospitals is disappointing. Different settings have different practices, different patient populations (with perhaps different disease prevalences), different types of equipment and different staff -this all may lead to drastic changes in the performance measures. Thus, it is essential in many applications to use multi-centre studies involving different hospitals from different countries, to make sure that the performance assessment results are as generally applicable as possible. This, obviously, is an expensive endeavour and a major reason for why uptake of new technologies in clinical practice is slow. In the previous section, we outlined various performance measures. In this section, we describe means to compute the performance measures in practice and discuss potential pitfalls that may arise. We will concentrate on non-parametric estimation principles (validation, cross-validation, bootstrap) that are equally applicable for all the performance measures introduced in the previous section. Hence, for clarity, we will focus on the estimation of the classification error (or equally, accuracy (Eq. (5))), noting that in the most cases, it can be replaced by any performance measure. Many fields of biomedicine have published their own guidelines on how to evaluate machine learning algorithms, for example, in radiology [15, 16, 17, 18] , and practitioners should be aware of the field-specific guidelines [19] . The TRIPOD (Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis) Statement includes a 22-item checklist, which aims to improve the reporting of studies developing, validating, or updating a prediction model [20] . At the same time, it needs to be understood that not all the performance estimation tasks are equal: different validation principles may be adept if evaluating the potential of new technology for use in biomedicine or a prototype of a product for clinical use. We recommend reading [21] , which summarizes the validation aspects from a clinical viewpoint, making strong points for the public availability of predictive algorithms in health-care. Two error types can be distinguished in machine learning: the training error and the test error. The training error refers to the classification errors for the training samples, i.e., how large portion of the training samples is misclassified by the designed classifier. The more important error type is the test error 8 , which describes how large portion of all possible objects is misclassified by the deployed algorithm/model. The theoretical Bayes classifier, introduced in Section 2, aims to minimize the test error when the class conditional probability densities and priors are known. However, these densities and priors are never known in practice. The training error is the frequency of error for the training data. The training error is an overly optimistic estimate of the test error. For example, the training error of the nearest neighbour-classifier is automatically zero. Obviously, this is not true for the test error. The estimation of the test error by the training error is termed the resubstitution method. Using the resubstitution method to estimate classification error is a severe methodological mistake, sometimes called testing on the training data. A much better estimate for the test error is obtained by dividing the training data into two disjoint sets, termed training and test sets. The training set is used for the training of the classifier and the test set is solely used to estimate the test error. This method to estimate the true test error is called the holdout method. If a natural division of the data into test and training sets does not exist, the division needs to be artificially produced. This artificial division needs to be randomized; for example, it is not a good idea to select all the examples of one class as the test set and others as the training set. It is a good practice to make the division stratified, so that there are equal proportions of classes in the training and test sets. Especially in the neural network literature, one encounters divisions of the available data in three sets, termed training, validation, and test. Then, the validation set is used to tune the parameters of the learning algorithm (learning rate, when to stop learning, etc.). A hidden difficulty arises when we have several samples obtained from the same subjects collected at different times (see reference [23] for an example how to handle this kind of situation correctly). Then, all the samples obtained from the same subject need to be in either training or test set. It is inappropriate that some samples obtained from, say, J (two-years ago) are in the training set while others obtained (a year ago) from J are in the test set. In other words, training and test sets must be independent. If the two sets are not independent, then the estimates of the test error will be positively biased and the magnitude of this bias can be surprisingly large. Cross-validation is a resampling procedure to estimate the test error. It is a generalization of the holdout method. In k-fold cross-validation (CV), the training set (X, Y ) is split into k smaller sets (X 1 , Y 1 ), . . . , (X k , Y k ) and the following procedure is followed for each of the k folds (see Figure 3 ): 1. A machine learning model is trained using all the except the ith fold folds as training data; 2. the resulting model is tested on ith fold (X i , Y i ). The performance measure of k-fold CV is then the average of the values computed in the loop. In textbooks that are several decades old, kfold CV has been viewed as computationally expensive, and therefore, the holdout method (see Section 4.1) has been suggested for relatively modest sample sizes. However, computers are now much faster than in the eighties or nineties, and k-fold CV is not likely to be computationally prohibitive in current health and wellness applications. Likewise, there exists a wrong perception that holdout would be (theoretically) preferable to CV, but this is not true. Further, the distinction between cross-validation and holdout is not the same as the distinction between internal and external validation. In particular, the holdout method is not a surrogate for having an independent test set. The parameter k is usually selected as 5 or 10 according to suggestions given by [24] 9 . However, in many cases, there might be a natural division of the data into k-folds, for example, the data may have been collected in k different medical centers, and then that natural division should be preferred. A special case, useful when the sample size is small, is leave-one-out CV (LOOCV), where each sample (or subject) forms its own fold, and thus k is equal to the number of data samples. Finally, remarks made about the independence of training and test sets in Section 4.1 hold also for k-fold CV -each fold should be independent. Repeated CV fixes the inherent randomness of the selection of the folds by re-running a k-fold CV multiple times. There are different opinions if this is useful or not, and in the opinion of the authors, rarely more than ten repeats are necessary. Note that different repeats of CV are not independent, so, for example, the variance of error estimates resulting from different CV runs (note the difference between runs and folds of a single run) is a useless quantity concerning the variance of the generalization error [22] . In stratified CV, the folds are stratified so that they contain approximately the same proportions of labels as the original data. For example, if 10 % of the training data belongs to the class 1 and 90 % of the training data belong to the class 2, then each fold should also have approximately this 10/90 division. The stratification is typically highly recommended [24] . CV and holdout only yield meaningful results if they are used correctly. Common pitfalls include: • Using the test labels (i.e., correct classes of the test set) in feature selection and/or extraction. Selecting features for the classification using the whole data (i.e., not just training set), leads to optimistically biased classifier performance estimates as demonstrated in, e.g., [25] . However, there exist also more subtle variations of the same issue. For example, preprocessing the data with principal component analysis (PCA) based on all the subjects of the healthy class commits the same mistake as it (inadvertently) uses the labels to select the data for PCA. More generally, if CV is used to simultaneously to optimize the model parameters and estimate the error (within the same CV), the error estimates will be optimistically biased and a procedure called nested CV is necessary with parameter tuning [26, 27] -we will return to this issue in Section 4.3.3. • Failing to recognize that CV-based error estimates have large variance especially when the number of samples is low. In this case, the classifier may appear very good (or bad) just because of chance. This issue, discovered over 40 years ago [28] , has received a reasonable amount of attention recently [29, 27, 30, 31, 32, 33] , however, its effects are still often underestimated. • Selecting folds in a way that the training and test sets are not independent, for example, when more than one sample exists from the same subject as we discussed in Section 4.1. : is there a difference? Many have voiced (e.g., [15] ) the requirement for independent test sets for the evaluation of machine learning algorithms in health and life science applications. Especially, if a clinical applicability of a trained machine learning model for a particular task needs to be evaluated, this is absolutely mandatory. However, we stress that the test set needs 1) to be truly independent (preferably not existing at the training time, see, e.g., a recent competition on Alzheimer's disease prediction for a good example [34, 35] ), and 2) needs to model the actual task as well as possible (i.e, collecting the test set at a hospital A when the actual method is to be used at a different hospital B may not be optimal). Breaking an already existing dataset into training and test sets as in the holdout method is rarely a good idea if the dataset is not large (in terms of the number of subjects) but it is better to use the cross-validation. What "large" means is dependent on the task at hand 10 , but to give a general guideline: 1) datasets of over 10,000 subjects can be divided into train and test sets, 2) datasets of 1000 to 10000 subjects, this depends on the case at the hand, 3) datasets under 1000 subjects: cross-validation is typically better. Also, if the data has been collected at several hospitals, it is relevant to perform leave-one-hospital out cross-validation and report the errors in all the hospitals, instead of selecting some hospitals as training and some as testing sets. It is good to keep in mind that the cross-validation approximates the performance of the classifier trained with all the available data. The classifiers derived from using different folds as the training set will differ. Thus, if a specific classifier needs to be evaluated, there is no alternative to the collection of a large test set [21] . Voicing the requirement for independent test sets for the evaluation of machine learning algorithms in health and life science applications has brought with it the confusion that the hold-out would be preferable to the CV-based error estimation. However, in the absence of a truly separate test set, CV leads always to better estimates of the predictive accuracy than the holdout as is demonstrated by a simulation in Figure 4 , where CV-based error estimates have much smaller mean absolute errors than hold-out based ones. This is because, in CV, the results of multiple runs of model-testing (with mutually independent test sets) are averaged together while the holdout method involves a single run (a single test set). The holdout method should be used with caution as the estimate of predictive accuracy tends to be less stable than with CV. In the simulation depicted in Figure 4 , both CV and holdout based error estimates are almost unbiased. Figure 4 : CV versus holdout with simulated data. The plot shows the mean absolute error in the classification accuracy estimation using CV and holdout with respect to a large, external test set consisting of one million samples. The plot shows that the classification accuracy estimate by 5-fold CV is always over two times better than the classification accuracy estimate using holdout with 20 % of training data in the test set. The number of features was varied (d = 1, 3, 5, 9), but the classification task was tuned in a way that the Bayes error was always 5 %. When the task was tuned this way, the number of features had very little effect on the accuracy of the estimated error. The classes were assumed to be Gaussian distributed with equal diagonal covariances and the two classes were equally probable. The classifier was the GNB introduced in Section 2. We trained GNBs with 1000 different training sets of the specific size to produce the figure. The code reproducing this experiment can be retrieved at https://github.com/jussitohka/ML_evaluation_ tutorial Modern ML algorithms come often with various hyper-parameters to tune (for example, the parameter C in support vector machines, mtry parameter in Random Forest, when to stop training in neural networks). If using holdout, the recommendation is to divide the data into three non-overlapping sets: 1) a training set to train the classifiers with various hyperparameters, 2) a validation set to decide which of the trained classifiers to use, and 3) a test set to the test the accuracy of the selected classifier. As we have already stated, omitting the test set, and estimating the accuracy based on the best result on the validation set can lead to severely positively biased accuracy estimates. Likewise, when CV is used simultaneously for selection of the best set of hyperparameters and for error estimation, a nested CV is required [26] . This again as model selection without nested CV uses the same data to tune model parameters and evaluate model performance leading to upward biased performance estimates demonstrated in several works [26, 25] . There exist several variants, but the basic idea, as the name indicates, is to perform a CV loop inside a CV loop. First, the whole data set is divided into k folds, and one at the time is used to test the model trained with the remaining k − 1 folds as in above with the ordinary CV. However, a CV is performed in each of k − 1 training sets to select the best hyperparameters, which are then applied to train a model with on the (outer) training sets. A pseudo-code for nested CV is presented in [25] . The high variance of the CV (and other non-parametric error estimators) hinders also the selection of hyperparameters for classification algorithms [27, 31] and, as a result, one should not be overly confident about the selected hyperparameters in small-sample settings. In order to to compare learning algorithms, experimental results reported in the machine learning literature often use statistical tests of significance. These tests answer, in a principled manner, the question if one machine learning algorithm is better than another on a particular learning task. However, statistically testing the significance is not completely straight-forward in machine learning scenarios. An early work by Dietterich demonstrated the drawbacks of some commonly used tests and suggested 5x2 CV test and McNemar test to compare the learning algorithms [36] . Nadeau and Bengio proposed corrected resampled t-test, which is the one that we recommended for comparing two learning algorithms [22] . This test is advantageous because it takes into account variability due to the choice of training set. A corrected repeated k-fold CV test is a version of this test that is particularly easy to implement [37] . There are several (non-parametric and parametric) alternatives to CV [38, 39, 40] . We will explain in more detail one of these, bootstrapping [41] , as it is convenient in learning algorithms that utilize re-sampling (e.g., bagging in Random Forests [42] ). In these algorithms, bootstrap (or variations thereof) error estimates are a side-product of the classifier training [43] . In bootstrapping, depicted in Figure 5 , the central idea is to select random bootstrap samples from the dataset. Typically, these bootstrap samples are of the same size as the original sample, but sampled allowing repetition. (For example, in the first bootstrap sample of Figure 5 (B 1 ), the sample X 5 is selected 5 times.) Then, the classifier is trained with the bootstrap sample and evaluated with the samples that were not selected as part of the bootstrap sample. These samples, labeled with yellow color in Figure 5 , are called out-of-bag samples. This process is repeated m times. A commonly used variant is .632 bootstrap estimate [40] , which, however, can be upward biased especially for low accuracies [24] . 4.6. Practical considerations 4.6.1. Dealing with small sample sizes Machine learning experts often get the question of what is the minimum sample size to design a machine learning algorithm. This is a tricky question as it depends on the application and what is the required performance. Also, for specific applications such as image segmentation, a very small number of images may be sufficient because every pixel is a sample. Instead, image classification may require many more images as now only each subject is a sample. However, the training sample must represent the population of all possible subjects well enough. If the available training set size is small in terms of the number of subjects, it is essential to use simple learning algorithms. Note that the term 'simplicity' refers to the number of parameters the algorithm has to learn. For example, a GNB classifier learns 2d+1 parameters, where d is the number of features. Nearest-neighbor classifier, albeit simple to implement, leads to much more complicated decision regions and more parameters to be learned [44] . GNB is typically a good choice if the number of samples is small. Also, as we have emphasized, it is a good idea to take CV-based and, in particular, holdout-based error estimates critically if the number of samples is small. As shown in Figure 4 the mean absolute error of the holdout errorestimate is larger than the Bayes error when the number of samples per class is 25 and, for the majority of iterations, the holdout reports zero errors (while the Bayes error-rate is 5%). To combat the small sample size problem, there are parametric error estimators that may succeed better than their non-parametric counterparts in small-sample scenarios [39] . Note that the effective sample size is the number of training data in the smallest class, i.e., if we have 1000 examples of one class and 3 of another, the trained classifier is not likely to be accurate. If faced with the situation with highly imbalanced class divisions, few considerations are necessary. First, as outlined in Section 3, it is essential to use an error measure that is appropriate for the problem. Second, if selecting hyperparameters of the algorithm, it is essential to use the same error measure to select the parameter values. Third, the use of stratification in the CV is absolutely necessary. Care should be taken when using oversampling techniques, such as SMOTE [45] , as these do not actually increase the number of training data. There are several choices when combining the AUCs from the different test partitions. Two of the simplest are [46] : • Pooling. The frequencies of true positives and false positives are averaged. In this way one average, or group ROC curve is produced from the pooled estimates of each point on the curve. • Averaging. AUC is calculated for each test partition and these AUCs are then averaged. More techniques for combining ROCs from several test partitions are introduced in [47] . The increasing efforts in research and development using data-driven approaches have both positive and negative effects. On the positive side, new applications and solutions are delivered to problems that were ten years ago still considered prohibitively difficult to solve. Think, e.g., of image recognition, speech recognition, natural language processing in general, and the highly successful advances in medical technology, especially in the image analysis field and assisted diagnostics. On the negative side, there is a hype situation in which there are unrealistic expectations with regard to techniques such as AI and machine learning. High expectations are set both by endusers or customers as well as in the scientific community itself. This leads to a situation where a thorough objective assessment of the performance is a task that is under pressure. As we have seen, proper performance assessment is not trivial (requires an understanding of the problem and data), and often costly (data needs to be reserved) and costs time. It is often less exciting than the development/training work itself and has the unfortunate property that often the performance estimations after objective validation are less than the 'highly promising' results that were obtained in the early development phase. Thus, meaning that enthusiasm from colleagues, potential customers, investors, and end-users might decrease as a consequence of 'proper' testing. All in all, there is pressure to deliver fast and publish (positive!) results as soon as possible. The tendency to publish only results that 'improve upon the state-of-theart (SotA)' is an increasing problem. This problem leads to situations where algorithms and parameter sets are tuned and re-tuned almost indefinitely towards an as good as possible performance; 'SotA-hacking'. This becomes highly problematic when the overall dataset is fixed, as is the case in publicly available databases or datasets available for pattern recognition competitions. No matter whether an appropriate CV scheme is used and an 'independent' test set is provided, the fact remains that an enormous amount of effort is dedicated to finding the optimal solution to one specific data set. There is insufficient information about how the algorithm behaves on other data in real-life. This is sometimes referred to as 'meta-training', as researchers are getting trained themselves and their environment to optimise their work for a specific data-set. The overall problem is wide and ultimately originates from inappropriate experimental design and hypothesis testing procedures, including so-called Hypothesizing After the Results are Known (aka HARKing) practices. The interested reader can find more information in [48] This paper aimed to give practical information about how to assess the performance of machine learning approaches, giving special attention to biomedical applications. It is meant as a pragmatic set of guidelines and collection of practical experiences gained in the field over several decades of research. We hope it allows the interested reader to take a critical and objective look on the subject and understand the main issues underlying performance assessment. Finally, it is worthwhile to re-emphasise that good performance is only one of the items in the list of requirements towards successful uptake of machine learning algorithms in practice: usability, seamless integration into existing processes and infrastructures, and explainability are all components that are of similar importance. They have their own metrics. Effect of separate sampling on classification accuracy On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes Quantification of Epileptiform Electroencephalographic Activity during Sevoflurane Mask Induction Reflection on modern methods: Revisiting the area under the ROC Curve Clinical alarm hazards: a top ten health technology safety concern Index for rating diagnostic tests Youden index and optimal cut-point estimated from observations affected by a lower limit of detection Comparison of the predicted and observed secondary structure of t4 phage lysozyme Measures of the amount of ecologic association between species Morphometric analysis of white matter lesions in mr images: method and validation Magnetic resonance image tissue classification using a partial volume model Predicting symptom severity in autism spectrum disorder based on cortical thickness measures in agglomerative data The meaning and use of the area under a receiver operating characteristic (ROC) curve Confidence intervals for the area under the ROC curve Assessing radiology research on artificial intelligence: A brief guide for authors, reviewers, and readersfrom the radiology editorial board French radiology community white paper, Diagnostic and interventional imaging Design characteristics of studies reporting the performance of artificial intelligence algorithms for diagnostic analysis of medical images: results from recently published papers Methodologic guide for evaluating clinical performance and effect of artificial intelligence technology for medical diagnosis and prediction Regulatory approval versus clinical validation of artificial intelligence diagnostic tools Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (tripod): explanation and elaboration Predictive analytics in health care: how can we know it works? Inference for the generalization error T1 white/gray contrast as a predictor of chronological age, and an index of cognitive performance A study of cross-validation and bootstrap for accuracy estimation and model selection Meg mind reading: strategies for feature selection Selection bias in gene extraction on the basis of microarray gene-expression data On over-fitting in model selection and subsequent selection bias in performance evaluation Additive estimators for probabilities of correct classification Is cross-validation valid for smallsample microarray classification? Performance of feature-selection methods in the classification of high-dimension data Model selection for linear classifiers using bayesian error estimation On the dangers of cross-validation. an experimental evaluation Cross-validation failure: small sample sizes lead to large error bars Tadpole challenge: Prediction of longitudinal evolution in alzheimer's disease The alzheimer's disease prediction of longitudinal evolution (tadpole) challenge: Results after 1 year follow-up Approximate statistical tests for comparing supervised classification learning algorithms Evaluating the replicability of significance tests for comparing learning algorithms Bolstered error estimation Bayesian minimum mean-square error estimation for classification errorpart i: Definition and the bayesian mmse error estimator for discrete classification Improvements on cross-validation: the 632+ bootstrap method An introduction to the bootstrap Random forests Out-of-bag estimation The Elements of Statistical Learning Smote: synthetic minority over-sampling technique The use of the area under the roc curve in the evaluation of machine learning algorithms An introduction to roc analysis HARK side of deep learning -from grad student descent to automated machine learning J. Tohka's work has been supported in part by grants 316258 from Academy of Finland and S21770 from European Social Fund. We thank Vandad Imani, University of Eastern Finland for drafting the figures 3 and 5.