key: cord-0036105-zr0bo9el authors: Pfannschmidt, Karlson; Hüllermeier, Eyke; Held, Susanne; Neiger, Reto title: Evaluating Tests in Medical Diagnosis: Combining Machine Learning with Game-Theoretical Concepts date: 2016-05-10 journal: Information Processing and Management of Uncertainty in Knowledge-Based Systems DOI: 10.1007/978-3-319-40596-4_38 sha: 77e23cfd444a7be96166e97272909ae1f08c5e5b doc_id: 36105 cord_uid: zr0bo9el In medical diagnosis, information about the health state of a patient can often be obtained through different tests, which may perhaps be combined into an overall decision rule. Practically, this leads to several important questions. For example, which test or which subset of tests should be selected, taking into account the effectiveness of individual tests, synergies and redundancies between them, as well as their cost. How to produce an optimal decision rule on the basis of the data given, which typically consists of test results for patients with or without confirmed health condition. To address questions of this kind, we develop an approach that combines (semi-supervised) machine learning methodology with concepts from (cooperative) game theory. Roughly speaking, while the former is responsible for optimally combining single tests into decision rules, the latter is used to judge the influence and importance of individual tests as well as the interaction between them. Our approach is motivated and illustrated by a concrete case study in veterinary medicine, namely the diagnosis of a disease in cats called feline infectious peritonitis. Different types of tests, such as measuring serum antibody concentrations, are commonly used in medical diagnostics in order to reveal the health condition of an individual. The effectiveness of a single test is typically determined by correlating the test outcome with the true condition. Moreover, classical statistical hypothesis testing can be used to compare different test procedures in terms of their effectiveness. In this paper, we tackle the problem of evaluating or selecting a test procedure from a slightly different perspective using methods of (semi-)supervised machine learning. Roughly speaking, the idea is that, by learning a model in which various candidate tests play the role of predictor variables, information about the usefulness of individual tests as well as their combination is provided by properties of that model. An approach of that kind has at least two important advantages: -First, it not only allows for judging the usefulness of single tests but also of combined tests, i.e., the combination of different tests into one overall (diagnostic) decision rule. Thus, it informs about possible synergies (as well as redundancies) between individual tests and the potential to improve diagnostic accuracy thanks to a suitable combination of these tests. -Second, going beyond the standard setting of supervised learning, a machine learning approach suggests various ways of improving the selection of tests by taking advantage of additional sources of information. An important special case is the use of semi-supervised learning to exploit "unlabeled" data coming from individuals for which tests have been made but the true health condition is unknown. This situation is highly relevant in medical practice, because tests can often be conducted quite easily, whereas determining the true health condition is very difficult or expensive. Our approach is motivated by a concrete case study in veterinary medicine, namely the diagnosis of a disease in cats called feline infectious peritonitis (FIP). Complete certainty about whether or not a cat is FIP-positive, and eventually will die from the disease, requires a necropsy [1, 10] ; unfortunately, no test performed in a cat while still alive has a 100 % sensitivity or 100 % specificity. Consequently, while different tests can be applied to cats quite easily, "labeling" a cat in the sense of supervised learning is expensive, difficult and time-consuming. In addition to the use of (semi-supervised) machine learning methodology in medical diagnosis, we propose a game-theoretical approach for measuring the usefulness of individual tests as well as model-based combinations of such tests. Roughly speaking, the idea is to consider a combination of tests as a "coalition" in the sense of cooperative game theory, and the "payoff" of the coalition as the diagnostic accuracy achieved by the test combination. This approach will be detailed in the next section, prior to elaborating more closely on our case study in Sect. 3, presenting experimental results in Sect. 4 and concluding the paper in Sect. 5. Suppose a set of tests X 1 , . . . , X K to be available. We consider the outcome of each test as a random variable X k : Ω −→ R, where Ω is the population of individuals to which the test can be applied. Jointly, the K tests thus define a random vector The health state is a dichotomous variable Y ∈ Y = {−1, +1}. Typically, each test is a positive indicator in the sense that P(Y = +1 | X k ) increases with X k , i.e., the larger X k , the larger the probability of the positive class. Using machine learning terminology, each test corresponds to a feature or predictor variable. Moreover, X is the instance space, each X ∈ X is an instance, and Y is the (binary) output or response variable. If a diagnostic decisionŷ ∈ {−1, +1} is not necessarily based on a single test X k alone, but possibly uses a combination of several tests, a first question concerns the way in which such a combination is realized. From a machine learning point of view, this question is related to the choice of an underlying models class (hypothesis space) where J ≤ K is the number of tests included in the decision rule. Formally, we specify a combined test in terms of the subset The model class H could be defined, for example, as the class of linear threshold functions of the form where w 1 , . . . , w J , t ∈ R + and · maps true predicates to +1 and false predicates to −1; moreover, σ(j) is the j-th test included in the combination, i.e., i.e., a decision rule that minimizes the loss in expectation. We denote the expected loss of this model, which corresponds to the Bayes predictors in H |A| , by e * (A) = L y, h * A (x) d P(x, y).(2) In practice, of course, neither the Bayes predictor h * A nor the ideal generalization performance e * (A) are known. Instead, we only assume a data set D = D L ∪ D U to be given, which consists of a set of labeled instances and possibly another set of unlabeled instances (test results without ground truth) D U = {x j } U j=1 ⊂ X . From a machine learning point of view, it is then natural to estimate the generalization performance on the basis of D for each A ⊆ [K] . To this end, models (1) can be fitted and their generalization performance can be estimated, for example, using cross-validation techniques or the bootstrap. More specifically, what can be estimated in this way is the generalization performance of a model that is trained on a combination A and data in the form of L labeled and U unlabeled examples. Therefore, we shall denote a corresponding estimate byê(A, L, U ) or simplyê(A) (assuming the underlying data to be given). Needless to say, the estimatesê(A) thus obtained are not necessarily monotone in the sense thatê(B) ≤ê(A) for A ⊆ B. In fact, while e * (A) is the generalization performance of the Bayes predictor, i.e., the model that is obtained in the limit of an infinite sample size (provided the underlying learner is consistent), the estimatesê(A) are obtained from models trained on a finite (and possibly small) data set. Therefore, practical problems such as overfitting become an issue, i.e., including additional tests may deteriorate instead of improve generalization performance. How can the ideal generalization performances be estimated? Starting with the finite-sample estimates our proposal is to correct these estimates so as to assure monotonicity. In fact, monotonicity is the main difference between the ideal and finite-sample scores. Apart from that, the ideal scores (3) should not differ too much from the estimates (4), i.e., e * (A) ≈ê(A), at least if the training data is not too small. These considerations suggest the following estimation principle: Find a set of values (3) that satisfy monotonicity while remaining as close as possible to the corresponding scores (4). This principle can be formalized as an optimization problem of the following kind: The above problem can be tackled by means of methods for isotonic regression. More specifically, since the inclusion relation on subsets induces a partial order on 2 [K] , methods for isotonic regression on partially ordered structures are needed [3, 14] . Consider the set function ν : Obviously, ν is a monotone measure (of the usefulness of combined tests). Moreover, this measure can be normalized by setting where ν (∅) is the performance of the best (default) decision rule that does not use any test, i.e., which either always predictsŷ = +1 or alwaysŷ = −1. The measure ν * (·) thus defined satisfies the following properties: Thus, ν * is a normalized, monotone (but not necessarily additive) set function, referred to as fuzzy measure or capacity in the literature [5] . For each combined test A, ν * (A) is a reasonable measure of the usefulness of this test. In a similar way, a measure v • can be defined on the basis of the finite-sample scores (4), that is, by normalizing ν where ν min = 1− max B⊆[K]ê (B) and ν max = 1− min B⊆[K]ê (B). Note, however, that this measure is not necessarily monotone. Which of the two measures is more meaningful, ν * or ν • ? The answer to this question depends on practical considerations and what the measure is actually supposed to capture. When being interested in the potential asymptotic usefulness of a test combination, then ν * is the right measure. Otherwise, if a model induced from a concrete set of training data is supposed to be put into (medical) practice, ν • is arguably more relevant. From the point of view of (cooperative) game theory, each (test) combination A ⊆ [K] can be seen as a coalition and ν ∈ {ν * , ν • } as the characteristic function, i.e., v(A) is the payoff achieved by the coalition A. Thanks to this view, we can take advantage of various established game-theoretical concepts for analyzing the importance of individual players, which correspond to tests in our case, as well as the interaction between them. In particular, the Shapley value, also called importance index, is defined as follows [17] : The Shapley value of ν is the vector ϕ(ν) = (ϕ(1), . . . , ϕ(K)). For monotone measures (such as ν = ν * ), one can show that 0 ≤ ϕ(k) ≤ 1 and K k=1 ϕ(k) = 1; thus, ϕ(k) is a measure of the relative importance of the test X k . The interaction index, as proposed by [13] , is defined as follows: This index ranges between −1 and +1 and indicates a positive (negative) interaction between the tests X i and X j if I i,j > 0 (I i,j < 0). It is worth mentioning that the approach put forward in this section is quite in line with the idea of Shapley value regression [11] , which makes use of the Shapley value in order to quantify the contribution of predictor variables in (linear) regression analysis (quantifying the value of a set of variables in terms of the R 2 measure on the training data). Feline infectious peritonitis (FIP) is a disease with an affinity to young cats, a predisposition to involve cats living in larger groups. As it exhibits typical physical examination and clinical laboratory findings, it appears to be easy to diagnose. However, while a presumptive diagnosis is quickly established, a definite diagnosis is difficult to impossible to obtain without gross and histopathological evaluation including immunohistochemistry [1, 10] . The seroprevalence is high, especially in catteries where up to 90 % of the cats are positive [2] , but also up to 50 % of cats living in single-cat households have coronavirus-specific antibodies [4] . Of these, 5-10 % will develop the deadly form of FIP. A characteristic symptom of FIP is body cavity effusion, which also appears in other diseases [8] . Several treatment options exist for some of these diseases while FIP is deadly and no reliable effective therapy is known so far [16] . Therefore, it is important to diagnose the correct disease early. Several diagnostic tests are available that diagnose FIP, for which sensitivity, specificity, positive and negative predictive value vary between different studies, presumably because different forms of FIP (effusive and dry) were investigated and because various clinical signs, geographic locations, years of investigation, prevalence and combination of tests were used [4, 6, 7, 9, 15, 18] . In studies so far, no cat had all available tests performed. The data underlying our study includes the following diagnostic tests: -Albumin to Globulin ratio, plasma (X 1 ) and effusion (X 2 ) -Rivalta test (X 3 ) -Presence of antibodies against feline coronavirus (FCoV, X 4 ) -Reverse transcriptase nested polymerase chain reaction (RT-nPCR) to detect FCoV-RNA in EDTA-blood (X 5 ) and in the effusion (X 6 ) -Immunofluorescence staining (IFA) of FCoV antigen in macrophages in the effusion (X 7 ) Our dataset consists of 100 cats in total. For 29 of these cats, a necropsy was performed to establish the gold standard diagnosis; 11 of the 29 cats were diagnosed with feline infectious peritonitis (FIP). Additionally, the above 7 diagnostic tests were performed on all cats (i.e., K = 7, L = 29 and U = 71). To estimate the generalization accuracy (in terms of the simple 0/1 loss function) of each of the 2 7 = 128 combined diagnostic tests, we employ a semisupervised classification technique called maximum contrastive pessimistic likelihood estimation (MCPL) [12] . Logistic regression with L 2 penalization is used as the base learner in MCPL, i.e., individual tests are combined using a linear model of the form (1). Estimatesê(A) of the (finite-sample) classification errors are obtained as follows: We resample the set of 29 labelled cats and split the resulting sample into 16 training and 13 test examples. The remaining 71 cats without label information are added to the training set. This procedure is repeated 501 times for each of the 128 combinations of tests, and the results are averaged. To obtain estimates e * (A) of the ideal generalization performances, the finite-sample estimates are subsequently corrected using isotonic regression [3, 14] as described in Sect. 2.4. To further illustrate the importance of the diagnostic test RT-nPCR, Fig. 2 shows the mean validated classification accuracy for all 128 test combinations. The 80 % empirical percentiles are indicated by the vertical lines, and the subsets are sorted in decreasing order of their mean validated accuracy. Moreover, the results for those subsets including RT-nPCR (measured in blood) are highlighted in blue. Evidently, the concentration of subsets containing RT-nPCR (blood) is systematically higher to the left of the plot, which confirms that the inclusion of the test improves diagnostic accuracy. The effect of isotonic regression on the finite-sample estimates is shown in Fig. 3 . Here, each blue dot corresponds to an estimateê(A) for a particular subset A of diagnostic tests. Since partial monotonicity, which is assured by isotonic regression, cannot be visualized in a two-dimensional plot, the data points are sorted by their corrected classification accuracy (and ties are broken at random). The green line shows the isotonic regression fit. The corrected performance estimates ν * (A) can subsequently be used to calculate the Shapley values for each diagnostic test. The results are shown in Fig. 4 . Due to the monotonicity of ν * , all values are now positive. Again, the RT-nPCR tests achieve the highest Shapley values, but FCoV antibody titer and IFA (effusion) obtain values > 0.15, too. Note that the relative order of the RT-nPCR tests changed from the one in Fig. 1 , probably due to their accuracy being very similar and the random nature of the bootstrap validation. Figure 5 shows the accuracy estimates for all subsets. The dots indicate the corrected accuracies ν * (A) and are used to sort subsets in decreasing order, while Fig. 2) , the subsets containing RT-nPCR (blood) can mostly be found on the left side of the plot; this trend is now even more pronounced. An important question for a veterinary physician is which combination A of tests to perform, taking into account both diagnostic accuracy and effort. Figure 6 shows the corrected accuracies ν * (A) (green dots) of all subsets of tests and their combined monetary cost in Euro. The Pareto set, consisting of those combinations that are not outperformed by any other combination in terms of both accuracy and cost at the same time, is indicated as a blue line. From a practical point of view, the result suggests to use a single diagnostic test, namely RT-nPCR (blood or effusion), because the inclusion of more tests yields only minor improvements. This is confirmed by the pairwise interaction indices shown for both measures ν • and ν * in Table 1 . All these measures are negative, suggesting that the tests are more redundant than complementary. Note that, once a decision in favor of using a single test is made, the Shapley value, as a measure of average improvement achieved by adding a test, is no longer the best indicator of the usefulness of a test. Instead, a selection should be made based on the tests' individual performance. With a validated accuracy of 87 %, RT-nPCR (effusion) appears to be the best choice in this regard. In this paper, we proposed a method for measuring the importance and usefulness of predictor variables in (semi-/supervised) machine learning, which makes use of concepts from cooperative game theory: subsets of variables are considered as coalitions, and their predictive performance plays the role of the payoff. Although our approach is motivated by a concrete application in veterinary medicine, namely the diagnosis of feline infectious peritonitis in cats, it is completely general and can obviously be used for other learning problems as well. For the case study just mentioned, our method produces results that appear to be plausible and agree with the medical experts' experience. Roughly speaking, there are two strong diagnostic tests that are significantly more accurate than others; practically, it suffices to use one of them, since a combination with other tests yields only minor improvements. There are several directions for future work. For example, the principle we proposed in Sect. 2.4 for inducing ideal generalization performances e * (A) from finite-sample estimatesê(A) is clearly plausible and, moreover, seems to be indeed able to calibrate the original estimates thanks to an ensemble effect. Nevertheless, it calls for a more thorough analysis and theoretical justification. Recommendations from workshops of the second international feline coronavirus/feline infectious peritonitis symposium Prevalence of feline coronavirus types I and II in cats with histopathologically verified feline infectious peritonitis Structure algorithms for partially ordered isotonic regression Performances of different diagnostic tests for feline infectious peritonitis in challenging clinical cases Fundamentals of Uncertainty Calculi with Applications to Fuzzy Inference Comparison of different tests to diagnose feline infectious peritonitis Using direct immunofluorescence to detect coronaviruses in peritoneal in peritoneal and pleural effusions Sensitivity and specificity of cytologic evaluation in the diagnosis of neoplasia in body fluids from dogs and cats Positive predictive value of albumin: globulin ratio for feline infectious peritonitis in a mid-western referral hospital population A comparison of lymphatic tissues from cats with spontaneous feline infectious peritonitis (FIP), cats with FIP virus infection but no FIP, and cats with no infection Analysis of regression in game theory approach Contrastive pessimistic likelihood estimation for semi-supervised classification Techniques for reading fuzzy measures (iii): interaction index Algorithms for a class of isotonic regression problems Using direct immunofluorescence to detect coronaviruses in peritoneal and pleural effusions Effect of feline interferon-omega on the survival time and quality of life of cats with feline infectious peritonitis A value for n-person games Detection of ascitic feline coronavirus RNA from cats with clinically suspected feline infectious peritonitis