key: cord-0675404-ziyobdrt
authors: Hand, David J.; Christen, Peter; Kirielle, Nishadi
title: F*: An Interpretable Transformation of the F-measure
date: 2020-07-31
journal: nan
DOI: nan
sha: 60477e5100b607f759b7ad5aa8eb83cde6b10ed1
doc_id: 675404
cord_uid: ziyobdrt

The F-measure is widely used to assess the performance of classification algorithms. However, some researchers find it lacking in intuitive interpretation, questioning the appropriateness of combining two aspects of performance as conceptually distinct as precision and recall, and also questioning whether the harmonic mean is the best way to combine them. To ease this concern, we describe a simple transformation of the F-measure, which we call F* (F-star), which has an immediate practical interpretation.

Many different measures have been used to evaluate the performance of classification algorithms (see, for example, Demšar, 2006; Ferri et al., 2009; Hand, 2012; Powers, 2011; Sokolova and Lapalme, 2009 ). Such evaluation is central to choosing between algorithms -to decide which is the best to use in practice, to decide if a method is "good enough", to optimise parameters (equivalent to choosing between methods), and for other reasons. The data on which such assessments are based is normally a test set (independent of the training data) consisting of a score and an associated true class label for each object. Here we consider the two class case, with labels 0 and 1. Objects are assigned to class 1 if their score exceeds some threshold t, and to class 0 otherwise. This reduces the data for the evaluation measure to a two-by-two table, the confusion matrix, with counts as shown in Table 1 . In general, such a table has four degrees of freedom. Normally, however, the total number of test set cases, n = a + b + c + d, will be known, as will the relative proportions belonging to each of the two classes (also sometimes called the priors, or the prevalence in medical applications). This reduces the problem to just two degrees of freedom, which must be combined in some way in order to yield a numerical measure on a univariate continuum which can be used to compare classifiers. The choice of the two degrees of freedom and the way of combining them can be made in various ways. In particular, the columns and rows of the table yield proportions which can then be combined (using the known relative class sizes). These proportions go under various names, including, recall or sensitivity, d/(d + b); precision or positive predictive value, d/(c+d); specificity, a/(a+c); and negative predictive value, a/(a + b).

These simple proportions can be combined to yield familiar performance measures, including the misclassification rate, the kappa statistic, the Youden index, the Matthews coefficient, and the F-measure.

Another class of measures acknowledges that the value of the classification threshold t which is to be used in practice may not be known at the time that the algorithm has to be evaluated (and the time at which a choice between algorithms has to be made), so that they average over a distribution of possible values. Such measures include the Area Under the Receiver Operating Characteristic Curve (AUC) and the H-measure (Hand, 2009; Hand and Anagnostopoulos, 2014) .

We should remark that the various names are not always used consistently and also that particular measures go under different names (an example being the equivalence of recall and sensitivity above), this being a consequence of the widespread applications of the ideas, which arise in many different application domains.

Many of the performance measures have straightforward intuitive interpretations. For example:

• the misclassification rate is simply the proportion of objects in the test set which are incorrectly classified;

• the kappa statistic is the chance-adjusted proportion correctly classified;

• the AUC is the probability that a randomly chosen class 0 object will have a score lower than a randomly chosen class 1 object; and

• the H-measure is the fraction by which the classifier reduces the expected minimum misclassification loss, compared with that of a random classifier.

The F-measure, which is particularly widely used in computational disciplines, also has a simple interpretation: it is the harmonic mean of the two confusion matrix degrees of freedom precision, P = d/(c + d), and recall, R = d/(b + d):

Since precision and recall are both important and in a sense complementary aspects of performance, it seems reasonable to combine them into a single measure. But averaging them may not be so palatable: it might be regarded as analogous to adding apples and oranges. Moreover, despite the seminal work of Van Rijsbergen (1979), some researchers are uneasy about the use of the harmonic mean (Hand and Christen, 2018) , preferring other forms of average (e.g. an arithmetic or geometric mean) which are arguably more immediately interpreted. The desire for an interpretable perspective is illustrated in, for example, Stack Exchange (2013).

In an attempt to tackle this unease, in what follows we present a transformed version of the F-measure which has a straightforward intuitive interpretation.

2. The F-measure and F* Plugging the counts from Table 1 into the definition of F , we obtain

So if we define F ′ as F ′ = F/2(1 − F ), we have that F ′ is the number of class 1 objects correctly classified for each object misclassified.

This is a straightforward and attractive interpretation of a transformation of the Fmeasure, and some researchers might prefer to use it. However, F ′ has the property that it is a ratio and not simply a proportion, so it is not constrained to lie between 0 and 1 -as are most other performance measures.

We can overcome this by a further transformation, yielding

Now, defining F * (F-star) as F * = F/(2− F ) 1 , we have that F * is the ratio of the number of correctly classified class 1 objects to the number of objects which are not correctly classified class 0 objects. Put another way, F * is the number of correctly classified class 1 objects expressed as a proportion of the number of objects which are either class 1, classified as class 1, or both. Or, yet a third alternative, F * is the number of correctly classified class 1 objects expressed as a fraction of the number of objects which are either misclassified or are correctly classified class 1 objects. That is, F * = d/(n − a), which can be immediately calculated from the confusion matrix. To illustrate, if class 1 objects are documents in information retrieval, then F * is the number of relevant documents retrieved expressed as a proportion of all documents except non-retrieved irrelevant documents. Or, if class 1 objects are COVID-19 infections, then F * is the number of infected people who test positive divided by the number who either test positive or are infected or both.

Since F * is monotonically related to F , clearly any conclusions drawn from F * will be identical to those drawn from F . In particular, choices between algorithms will be the same.

Van Rijsbergen (1979) also defines a weighted version of F , placing different degrees of importance on precision and recall. This carries over immediately to yield weighted versions of both F ′ and F * .

The overriding concern when choosing a measure of performance in supervised classification problems should be to match the measure to the objective. Different measures have different properties, emphasising different aspects of classification algorithm performance. A poor choice of measure can lead to the adoption of an inappropriate classification algorithm, in turn leading to suboptimal decisions and actions.

A distinguishing characteristic of the F-measure is that it makes no use of the a count in the confusion matrix -the number of class 0 objects correctly classified as class 0. This can be appropriate in certain information retrieval tasks (for which the measure was originally developed) if there is a potentially large and unknown number of possible matches (when a corresponds to irrelevant documents that are not retrieved). In other contexts, however, such as in medical diagnosis, correct classification to each of the classes can be important.

The F-measure uses the harmonic mean to combine precision and recall, two distinct aspects of classification algorithm performance, and some researchers question the use of this form of mean and the interpretability of their combination. However, we have shown that suitable transformations of F have straightforward intuitive interpretations.

Statistical comparisons of classifiers over multiple data sets

An experimental comparison of performance measures for classification

Measuring classifier performance: A coherent alternative to the area under the ROC curve

Assessing the performance of classification methods

A better Beta for the H-measure of classification performance

A note on using the F-measure for evaluating record linkage algorithms

Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation

A systematic analysis of performance measures for classification tasks

Information retrieval