key: cord-0129686-9on1tq0d
authors: Niyazi, Lama B.; Kammoun, Abla; Dahrouj, Hayssam; Alouini, Mohamed-Slim; Al-Naffouri, Tareq
title: Weight Vector Tuning and Asymptotic Analysis of Binary Linear Classifiers
date: 2021-10-01
journal: nan
DOI: nan
sha: fc109e7d49981b53cd8bb1de45e22f772eb374b8
doc_id: 129686
cord_uid: 9on1tq0d

Unlike its intercept, a linear classifier's weight vector cannot be tuned by a simple grid search. Hence, this paper proposes weight vector tuning of a generic binary linear classifier through the parameterization of a decomposition of the discriminant by a scalar which controls the trade-off between conflicting informative and noisy terms. By varying this parameter, the original weight vector is modified in a meaningful way. Applying this method to a number of linear classifiers under a variety of data dimensionality and sample size settings reveals that the classification performance loss due to non-optimal native hyperparameters can be compensated for by weight vector tuning. This yields computational savings as the proposed tuning method reduces to tuning a scalar compared to tuning the native hyperparameter, which may involve repeated weight vector generation along with its burden of optimization, dimensionality reduction, etc., depending on the classifier. It is also found that weight vector tuning significantly improves the performance of Linear Discriminant Analysis (LDA) under high estimation noise. Proceeding from this second finding, an asymptotic study of the misclassification probability of the parameterized LDA classifier in the growth regime where the data dimensionality and sample size are comparable is conducted. Using random matrix theory, the misclassification probability is shown to converge to a quantity that is a function of the true statistics of the data. Additionally, an estimator of the misclassification probability is derived. Finally, computationally efficient tuning of the parameter using this estimator is demonstrated on real data.

A binary linear classifier classifies a data point to one class or the other by thresholding a discriminant that is a linear combination of the data features. The weights of the features make up a weight vector and the constant term in the discriminant is the bias of the classifier.

Despite the availability of sophisticated non-linear methods for classification, linear classifiers are still widely used. In fact, new variants of standard linear methods catering to specific settings and applications are being developed all the time. A search of the recent literature reveals that linear classifiers are being employed in many tasks including clinical neuroimaging [1] , digital pulse shape discrimination [2] , predicting the genetic merit of beef cattle [3] , and in conjunction with other methods for applications such as pathogen identification [4] , strategy representation [5] , and cancer classification [6] . Linear classifiers are especially suited to certain high-dimensional datasets on which they perform comparably with non-linear classifiers, with the advantage of much faster training times and quicker classification [7] . Due to ease of computation, linear classifiers further make good trial classifiers during the initial exploratory phase, when the relationship between the data features and labels is yet unknown [8] .

One way of improving a given linear classifier's performance on a particular dataset is by tuning its bias so as to minimize training error on that dataset [9] . Because the bias is a scalar, a grid search for the optimum is computationally undemanding. Even the need for a grid-search can be eliminated in many cases for which explicit representations of the optimal bias can be derived. For example, the authors of [10] derive an explicit bias correction of the Linear Discriminant Analysis (LDA) classifier discriminant in order to improve classification in the high estimation noise regime. The authors of [11] similarly correct for the bias of this classifier in an explicit form, but in the context of cost-sensitive classification. Additionally, the references [12] and [13] provide explicit bias corrections for certain high-dimensional variants of LDA. A related question has to do with improving upon a linear classifier's weight vector, which cannot be tuned or corrected in the same way. Relying on the intuition that a good weight vector should be able to extract the maximum discriminatory information content from the data point being classified, we show in this work that tuning the multidimensional weight vector can indeed be reduced to tuning a scalar.

In the first half of this paper, it is shown that any binary linear classifier discriminant can be decomposed into terms containing discriminating information and non-discriminating noise. A linear form of this decomposition parameterized by a variable α controls the trade-off between conflicting noise and information terms. At the optimal setting of α, the modified discriminant performs at least as good as the original classifier from which it was produced. Following this, the effect of the weight vector modification on the performance of an assortment of linear classifiers under different data dimensionality and sample size scenarios is studied. The method specifically yields significant performance gains for the Linear Discriminant Analysis (LDA) classifier under high estimation noise. Interestingly, the parameterized LDA operates as a bridge between LDA and the nearest centroid classifier, and performs at least as good as either of these classifiers.

Additionally, it is shown that tuning the weight vector according to the proposed method can significantly improve the performance of certain classifiers whose native hyperparameters are not optimally set. It is shown that with weight vector tuning, the Support Vector Machine (SVM) with non-optimally tuned penalty can achieve performance close to that of its tuned counterpart. In this case, tuning the weight vector is fundamentally different from tuning the native hyperparameter of the classifier as it occurs post weight vector generation, while the native hyperparameter tuning occurs prior to weight vector generation. For SVM, generating the weight vector for each value of the native hyperparameter involves solving an optimization problem. Tuning the weight vector according to the proposed method, however, reduces to a simple grid search over a scalar parameter. This idea can be generalized to any classifier with hyperparameters that are set prior to weight vector generation.

The second half of the paper consists of an asymptotic study of the parameterized LDA classifier under a growth regime in which the data dimensionality and sample size grow proportionally. We use random matrix theory to show that the probability of misclassification of this classifier converges to a limit that is a function of the true class statistics. We also derive a consistent estimator of the probability of misclassification by which the classifier parameter α can be tuned. This estimator is more computationally efficient than other tuning methods which rely on additional testing points or recycling the training set, e.g. cross-validation, as it requires no additional testing points and no averaging. We demonstrate its performance on real data.

An additional finding of this work is a new interpretation of the optimality of LDA. The LDA decision rule, derived by maximizing the posterior probability of a test point, assuming that it is drawn from a Gaussian distribution with classes having distinct means and common covariances, yields a weight vector which is the optimal Bayes direction. It can be shown that, under a common class covariance, the weight vector resulting from Fisher's linear discriminant, in which the ratio of the distance between the projected class means and the within class variance is maximized, is proportional to the Bayes direction [9] . A proportional solution can also be arrived at via a least squares formulation of the fitted data from their labels [9] in the binary case [9] . This makes the Bayes direction optimal in the posterior probability sense, the Fisher's linear discriminant sense, and the least squares sense. Moreover, this paper shows that the Bayes direction is optimal in the sense that it achieves the minimum noise (in the mean square error sense) with respect to the test point when the classes are Gaussian with common covariance.

To summarize, the main contributions of this paper are • A practical method for weight vector tuning which reduces to grid search over a scalar parameter • A novel interpretation of the optimality of the LDA classifier in terms of minimizing test point noise • Asymptotic expressions for the probability of misclassification of the parameterized LDA classifier • A consistent estimator of the probability of misclassification of the parameterized LDA classifier Throughout the paper, scalars are denoted by plain lower-case letters, vectors by bold lowercase letters, and matrices by bold upper-case letters. The symbol I p is used to represent the p × p identity matrix, the symbol 1 p represents the all-ones p × 1 vector, and the symbol 0 p represents the all-zeros p × 1 vector. The notation || · || is used to symbolize the Euclidean norm when its argument is a vector and the spectral norm when its argument is a matrix. The operator · rounds its argument up to the nearest integer. Almost-sure convergence is denoted by [14] , for a sequence of random square matrices 

Consider a supervised classification problem in which a test point x ∈ R p is to be labeled as belonging to one of two classes C 0 and C 1 . A linear classification approach to this problem imposes a discriminant of the form

characterized by a weight vector, w ∈ R p , and bias, w 0 ∈ R. The decision rule (1) then classifies x to one of the two classes, i.e., C(x) = i indicates that x is classified to class C i , i = 0, 1. Examples of classifiers which fit this form include LDA, SVM and Least-Squares SVM (both using linear kernels), and Regularized LDA (R-LDA).

In this paper, we propose a method of tuning the weight vector w, which reduces the nondiscriminative 'noisy' components of the original discriminant (1). As a result, the modified discriminant achieves a testing error rate at least as good as the original and, in certain cases, much better.

Throughout this paper, let the means and covariances of classes C 0 and C 1 be denoted by µ 0 , Σ 0 and µ 1 , Σ 1 respectively. In Section II-A, we explore an ideal case in which the discriminant neatly decomposes into separate information and noise terms and the noises cancel out optimally in a linear fashion under the assumption of perfectly known means and that C 0 and C 1 makeup a Gaussian mixture model with common class covariance Σ 0 = Σ 1 = Σ. Inspired by the findings of Section II-A, in Section II-B we heuristically extend this result to a more practical scenario which assumes unknown means and no restriction on the class distributions.

In this section, assume that the data distribution means µ 0 and µ 1 are known exactly and that

We proceed to derive a noise-minimized version of (1).

Consider the shifted test pointx = x − µ 0 +µ 1 2 . For any given classifier with weight vector w, we show that the projection ofx onto w, i.e., w Tx , can be decomposed into 'informative'

components which aid in discriminating the class of x and 'noisy' components which interfere with discriminating the class of x. We then take advantage of this hidden structure for the purpose of reducing the overall noise and obtaining a better classifier.

Let µ = µ 1 − µ 0 . The expressionx can be expressed as the sum of its projection onto µ and projection orthogonal to µ asx

is the projection orthogonal to µ. Substituting (2) into w Tx results in the decomposition of w Tx as

We now show that the first term in (3) is composed of an informative component and noisy component with respect to x, while the second term consists solely of noise. Assume x ∈ C i , where i is either 0 or 1. Then, assuming the Gaussian mixture model

with i th class prior (3) is then distributed as follows

The first term in (5) carries information about the class of x through its sign. The second term is the same regardless of the class of x and therefore carries no discriminating information. This is a direct result of assuming a common covariance between C 0 and C 1 . The informative component is denoted by I 1 while the noisy component with respect to x is denoted by N 1 . The second term of (3) is

The discriminatory component of this term is lost in the orthogonal projection, and therefore this term consists solely of noise with respect to the testing point, denoted by N 2 .

In the interest of achieving better classification performance, we wish to reduce the overall noise content in the discriminant. To this end, consider the following modification of the discriminant (3),

for any function g(·), and which, by the above analysis, is equivalent to

The optimal g(·) such that

. This choice of g(·) has the effect of minimizing the total noise in the discriminant in the mean square error sense. In the following Lemma 1, we derive the exact form of g(·) for a given w based on the class distribution assumptions (4).

The optimal g(N 2 ) is the linear function of N 2 given by g * (N 

Proof: Given w,

are jointly Gaussian random variables. Thus, the optimal g * (N 2 ) = E[−N 1 |N 2 ] reduces to a linear function of N 2 given by

Note that N 2 is observable only through the expression w T P µx and so when using this result we replace N 2 by its observable counterpart.

Based on this result, we have the following theorem.

The discriminant that minimizes the noise with respect to the test point in the MSE sense for a given w, known means, and under the data distribution assumptions of (4), is

or, equivalently,

This result is obtained by simply evaluating (6) using g * (·). We make several remarks concerning this result. Firstly, the modified discriminant is linear. This is a direct result of the Gaussian assumption (4), which, while not technically necessary, is desirable, as it produces a simple linear form which inspires the parameterized formulation presented in the next section.

Secondly, the original weight vector w is modified to w and a bias w 0 is generated. This bias is the optimal bias in the sense of minimizing the probability of misclassification under the class distribution assumptions of (4) and equal class priors when fixing the weight vector to w (see [15] Proposition 2). Finally, viewing the modified discriminant (8) as a function of a parameter α as follows

α = α MMSE (w) yields a stationary point of its probability of misclassification and achieves the minimum probability of misclassification when w T µ > 0. This is demonstrated in Section II-A1.

The following corollary of Theorem 1 lends intuition as well as credibility to this technique by showing that it recovers the Bayes optimal classifier discriminant for the assumed class distributions from its weight vector. The Bayes classifier in this case is linear. It is the LDA classifier, with decision rule

The LDA weight vector is w = Σ −1 µ. Since there is no modification of the weight vector, we conclude that the LDA weight vector (in the case of known statistics) is optimal relative to itself in that it achieves the minimum noise (in the mean square error sense) with respect to the test point under the assumed class distributions.

1) Experiments with Known Means: For the following simulation and any simulations involving synthetic data in the remainder of this paper, the exact expected testing error/probability of misclassification of a linear classifier learned on a given training set is computed using knowledge of the data distribution from which the testing data is generated. All synthetic data in this paper is generated from a two-class Gaussian mixture model. The expected testing error under these data distribution assumptions of a generic binary linear classifier

with weight vector β and intercept β 0 , can easily be derived as (see Lemma 1 in [16] )

Now consider the parameterized version (9) of (8). The objective of the following simulation is to show that α MMSE (w) given by (7) coincides with the α yielding a stationary point of the expected testing error of (9). The stationary point is a minimum when w T µ > 0 and is otherwise a maximum, as in that case, the orientation of w flips the class labels.

To demonstrate this, a weight vector w is uniformly sampled from all w such that w 2 = 1 using the method in [17] . It is then fed to (9) and the exact expected testing error with varying α is plotted using (11) . The quantity α MMSE (w) is then computed from (7) for comparison. The class statistics used for this simulation are

and Here, π 0 = π 1 = 0.5. Figure 1b show the results when w T µ > 0 and w T µ < 0, respectively. In Figure 1a , the minimum expected testing error occurs at α = 0.15341. This exactly coincides with α MMSE (w) of Theorem 1 that minimizes the noise in the discriminant. In Figure 1b , the maximum expected testing error occurs at α = −0.12029, which, again, exactly coincides with α MMSE (w) that minimizes the noise in the discriminant. The latter discriminant's behavior can be explained by the fact that the orientation of the randomly generated w flips the class labels.

Simply taking the negative of w yields a classifier having the minimum expected testing error at α MMSE (w). In conclusion, minimizing the noise in the discriminant in the MSE sense is equivalent to minimizing the expected testing error, as long as w is sensibly oriented. This motivates using this criteria as the basis for designing a better classifier in the next section.

The previous section derives the discriminant with minimum noise with respect to the test point for a general binary linear classifier with weight vector w under the assumption of Gaussian classes with known means and a common covariance. A more practical scenario is when all class statistics are unknown and sample statistics are used instead. Using the sample mean estimates introduces an additional estimation noise into the discriminant.

Let the n individual training vectors corresponding to classes C 0 and C 1 make up the columns of the matrices X 0 ∈ R p×n 0 and X 1 ∈ R p×n 1 , respectively (n = n 0 +n 1 ). The maximum likelihood estimates of the class means are given by the sample meansμ 0 = 1 n 0 X 0 1 n 0 andμ 1 = 1

. Given a weight vector, w, w Tx can be expressed as

where Pμ = I −μμ T µ Tμ . Regardless of the class distributions and whether assuming distinct covariances Σ 0 and Σ 1 or common class covariances Σ 0 = Σ 1 = Σ, following a similar line of logic to the analysis in Section II-A reveals that, while the first term in (14) is similarly composed of both information and noise (whether that be estimation noise, noise from the test point, or both), the second term is not purely noise. In fact, it is informative. This is shown in detail in Appendix A.

Thus, when the means are unknown, the approach taken in Section II-A of minimizing the squared sum of 'noise 1' with the second term no longer applies, as the second term is informative. Nonetheless, the interaction of this term with the noise in the first term can potentially yield performance gains and so motivated by Section II-A, the following parameterized version of the sample statistic equivalent of (8) is proposed

where α is a parameter to be tuned.

The following Section II-B1 demonstrates that a better misclassification rate may be achieved by setting α to a value that is not equal to one (where α = 1 recovers the original projection with optimal bias assuming equal priors and the class distribution in (4)). A significant improvement is observed when the estimation noise is high.

In this section we explore the behavior of (15) under a variety of settings and for an assortment of starting weight vectors. We first list and briefly describe the discriminants from which these weight vectors are extracted, namely, LDA, logistic regression, linear support vector machine (SVM), regularized LDA (R-LDA), and randomlyprojected LDA ensemble (RP-LDA).

• LDA (see [9] ) in the form (10) is the Bayes classifier for data distributed as (4) . In practice, the class statistics are unknown and sample estimates are used instead. The sample meanŝ µ 0 andμ 1 are defined at the beginning of Section II-B. The maximum likelihood estimates of the common covariance matrix and class priors are the pooled sample covariance matrix

Its weight vector is w =Σ −1μ .

• For linearly separable training data, SVM with linear kernel (see [9] ) finds a hyperplane that maximizes the margin between one class and the other subject to constraints of perfect classification on the training points. When the training data is linearly inseparable, the constraints are relaxed by penalizing each (possibly) misclassified point. The penalty is a parameter that must be tuned. This variant is called the soft-margin SVM with linear kernel,

and it is what we use in this paper.

• Logistic regression (see [9] ) models the log-odds 'ln P [x∈C 1 |x] 1−P [x∈C 1 |x] ' as a linear function of the test point. The decision boundary corresponds to the set of points at which the log-odds equals zero. The weight vector and bias of the decision boundary are learned by maximizing the likelihood of the training data.

• R-LDA counters the small sample issue in LDA by regularizing the pooled sample covariance estimate before inverting it. There are several possibilities for the form of the regularization (see [18] ). In this paper we opt for

where γ is the regularization parameter that must be tuned. The weight vector here is

• RP-LDA ensemble (see [19] ) counters the small sample issue in LDA by reducing the dimensionality of the training samples (and test point) using random matrices. Each pro-jection R i ∈ R d×p yields a discriminant. These are averaged over all M projections so that the final discriminant has the form

The reduced dimension d is a parameter that must be tuned.

For these simulations, we consider two data distributions: data generated from classes having a common covariance and data generated from classes having distinct covariance matrices. We also consider three regimes of n versus p: n on the order of p (p = 400, n = 450), n > p (p = 10, n = 500), and n < p (p = 300, n = 100). We apply the appropriate classifiers to each regime. LDA requires n > p, soft-margin SVM is applicable in any regime, logistic regression requires n be much greater than p to ensure convergence of the maximum likelihood estimates of the weight vector and bias, and finally, R-LDA and RP-LDA are designed for the regime n < p.

Each classifier is trained on a generated training set. Additionally, for SVM, R-LDA, and RP-LDA, the penalty, γ, and d parameters are chosen to minimize the expected testing error given that training set. The SVM penalty is tuned within the set {10 −4 , 10 −3 , 10 −2 , 10 −1 , 1, 10, 100, 1000}, γ within the set [10 −4 , 2], in increments of 0.1, and d from 1 to the maximum allowable setting of d = rank Σ − 2, in increments of 2. After this is done, we have a weight vector w for each classifier. Each weight vector is fed into (15) to obtain an α-parameterized version of the discriminant. Let us refer to these new classifiers as α-LDA, α-SVM, α-log, α-RLDA, and α-RPLDA for short. For each α-parameterized discriminant, we vary α and compute the expected testing error using (11) . These errors are averaged over 100 independently generated training sets. Error bars depicting the standard errors are plotted alongside this average.

Recall that setting α = 1 in (15) produces a discriminant having the original weight vector w and a bias with minimum probability of misclassification (under the Gaussian mixture model and equal priors assumption) for that weight vector. In what follows, we use α = 1 as a reference point for determining whether or not there is a significant improvement in classifier performance at the α achieving the minimum error rate. To quantify the improvement, we report percentage changes relative to the average expected testing error at α = 1 computed as error at α achieving the minimum−error at α=1 error at α=1 × 100. This quantity reflects the fact that a given error improvement starting at an already low error rate at the baseline α = 1 is more significant than when the error is high to start with.

The first set of class statistics we consider are (12) , (13) , and π 0 = π 1 = 0.5. varying α when p = 300 and n = 100. Here, the relative decreases in error do not exceed 0.6%.

As described at the beginning of this section, for each training set, the SVM penalty is tuned to the value yielding the lowest expected testing error. We found that SVM does not show much improvement when it is α parameterized. It is interesting to observe what happens when the penalty is not tuned beforehand. Instead we set the penalty to 1 (its default setting in the MATLAB R2019b 'fitcsvm' function) uniformly across all training sets. Figure 7 shows the resulting average expected testing error of α-SVM plotted against vary α in the same setting as in Figure 5b , i.e. p = 450, n = 400, and distinct Σ 0 and Σ 1 . In this case, α-SVM achieves a relative decrease in error of 17.7% at α = 0.2. Clearly, the method improves performance when w itself is not at its optimal.

Taking this idea further, we show that tuning the weight vector of a SVM classifier with a poorly chosen penalty can compensate for the resulting loss in performance. Figure 8 For the digit pair '2' and '6', an optimized SVM classifier can achieve a testing error of 0.0217. Figure 8a shows the testing error of α-SVM starting with a poorly tuned SVM classifer whose testing error on this digit pair is 0.0489. By weight vector tuning, the testing error can be brought down to 0.0272. This is comparable to the performance of the original optimized SVM classifer. Similarly, for the digit pair '3' and '5', an optimized SVM classifier can achieve a testing error of 0.0675. Figure 8a shows the testing error of α-SVM starting with a poorly tuned SVM classifer whose testing error on this digit pair is 0.0951. By weight vector tuning, the testing error can be brought down to 0.0736.

The significance of this finding is the potential savings in computation that can be made by weight vector tuning versus penalty tuning. The reason for this is that weight vector tuning is an afterthought; it occurs post weight vector generation. On the other hand, setting the penalty is done prior to weight vector generation. An optimization problem must be solved to generate the weight vector with each setting of the penalty.

This idea generalizes to any linear classifier whose native hyperparameters are set prior to weight vector generation. The tuning of the hyperparameters will then involve repeatedly generating the weight vector. If this process is costly, weight vector tuning can provide a more computationally efficient method of improving performance than tuning the native hyperparameters. Another example that is not demonstrated here is the RP-LDA ensemble classifier whose projection dimension d is a native hyperparameter. Tuning this is computationally inefficient as it means projecting all the data with each setting of d. A simple alternative is weight vector tuning.

Overall, we conclude from this section that α-LDA in the 'n on the order of p' scenario shows the most promise in terms of improved performance. For this reason, we proceed to study this classifier in the RMT asymptotic regime in the next section. 

As suggested by the name, the nearest centroid classifier classifies x to the class with nearest sample mean. It is the Bayes classifier for data distributed as (4) when Σ = I p .

As the previous section has shown, α-LDA exhibits the greatest improvement in performance among the sampled classifiers, particularly when the data dimensionality p is on the order of the number of samples n. This can be attributed to the fact that the LDA weight vector is an explicit function of the sample statistics. Due to estimation noise, there is much to be gained in this regime. We thus pursue an asymptotic study of α-LDA in growth regime where n and p grow at constant rates to each other. Under this growth regime, we derive an asymptotic expression and an estimator for the probability of misclassification of α-LDA.

In this section we first show that under the following growth regime assumptions (a) 0 < lim inf p n < lim sup p n < 1

and considering the training set to be random, the probability of misclassification of the α-LDA classifier converges to a quantity that is a function of only true statistics. This quantity is referred to as the deterministic equivalent of the probability of misclassification. The deterministic equivalent approximates the random realization of the probability of misclassification, and can be useful for understanding the behavior of the classifier with synthetic data, for which the statistics are perfectly known. In practice, however, the statistics are unknown. For this reason, we also derive an estimator of the probability of misclassification which is consistent under the same growth assumptions. This is referred to as a G-estimator of the probability of misclassification and can be used to tune α. To proceed with these derivations, we first require an expression for the expected probability of misclassification.

Assuming the classes C 0 and C 1 are Gaussian with means and covariances µ 0 , Σ 0 and µ 1 , Σ 1 respectively, the probability of misclassification of a test point x by the α-LDA classifier has the form ε = π 0 Φ m 0

where m 0 , m 1 , σ 2 0 , and σ 2 1 are the discriminant means and variances conditioned on x ∈ C 0 and x ∈ C 1 respectively. Define ρ =μ TΣ−1μ µ Tμ . Then

In the following sections, we present the DEs and G-estimators for both the general case of distinct covariances and the special case of common covariances. 

under the growth regime assumptions (a)-(e), it is

(see Lemma 2 in [20] for proof). Thus, the deterministic equivalentε is itself a function of deterministic equivalentsm 0 ,m 1 ,σ 2 0 , andσ 2 1 which are also functions of only true statistics. In the following theorem, we state the expressions ofm 0 ,m 1 ,σ 2 0 , andσ 2 1 which are used to computeε. This is followed by a corollary which corresponds to the special case when

[Ω] 1j = n j−1 − 1 n − 2 1 Note in these statements that while technically n−2 is equivalent to n asymptotically, we retain the n−2 in these expressions for increased accuracy in finite dimensions.

andδ andν are the results of the fixed point iteratioñ 

under the growth regime assumptions (a)-(e), it iŝ

The following theorem states the expressions ofm 0 ,m 1 ,σ 2 0 , andσ 2 1 which are used to computê ε. This is followed by a corollary which is specific to the case when Σ 0 = Σ 1 = Σ is assumed.

First, define

Theorem 3 (Distinct covariance G-estimators) The G-estimatorsm 0 ,m 1 ,σ 2 0 , andσ 2 1 , satisfying (17) under the growth regime assumptions (a)-(e) are given bŷ

Proof: See Appendix C-A.

The G-estimatorsm 0 ,m 1 ,σ 2 0 , andσ 2 1 , satisfying (17) under the growth regime assumptions (a)-(e) are given bŷ

Notice thatε is a function of the sample statistics. It estimates the probability of misclassification without the need for additional testing data and it is much more computationally efficient than the cross-validation procedure. In the next section, we show how to useε for the purpose of tuning the α parameter.

In this section, α-LDA is applied to real data. The objective is to show how α-LDA performs as compared to LDA and the nearest centroid classifier on real data, as well as to demonstrate the use of the G-estimatorε in tuning the α parameter. We consider binary classification of digit pairs from the USPS dataset [21] and phoneme pairs from the dataset [22] . For each problem, we train and test LDA, nearest centroid, and α-LDA on the relevant dataset. The empirical errors are plotted against varying α. Also plotted is the G-estimatorε of the error of α-LDA. 2 Figure 9 shows the results on two digit pairs from the USPS dataset. As mentioned in Section II-B1, this dataset consists of grayscale images of handwritten digits 0 − 9 encoded as 256dimensional vectors.

For Figure 9a , we use the digit pair '2' and '6'. Overall, there are n = 1395 total training vectors and 368 total testing vectors corresponding to this digit pair. The figure shows that LDA achieves the lowest empirical error on this digit pair. This performance is matched by α-LDA 2 Note that for these particular datasets, the two G-estimators almost match. Out of the two, the G-estimator which assumes common covariances is plotted. pronunciation) were extracted in order to construct this binary classification problem. As the dataset is not pre-divided into training and testing sets, the splitting was performed randomly.

We take advantage of this to construct a classification problem in which n is not much greater than p. A training set consisting of 400 samples is randomly extracted from the full set of 'aa' and 'ao' phonemes according to the same proportions. This leaves 1317 samples for testing. Based on the simulations from the previous section, we expect to observe a much greater performance gain in this scenario compared to Figure 9 . Figure 10 shows that, as expected, α-LDA significantly outperforms LDA with an error of 0.224 corresponding to the former compared to 0.3083 corresponding to the latter. It achieves a 27.3% decrease in error at α = 0.525. In this case, it seems that the data leans more towards an isotropic covariance structure, as nearest centroid performs better than LDA. Even so, α = 0

is not optimal. Thus, α-LDA provides the best balance between both of these classifiers. Lastly, the G-estimator points towards an α setting of 0.4. Using this setting incurs an increase in error of just 0.0023 relative to the optimal setting.

In this work, we design a method of weight vector tuning for binary linear classifiers based on the decomposition of the discriminant into informative and noisy components. The tuning takes the form of a linear parameterization of the decomposition. Deriving this method reveals a novel interpretation of the classic LDA classifier weight vector as minimizing the noise from the test point to which it is applied.

We simulate the performance gain of this method for a variety of linear classifiers: LDA, SVM, logistic regression, R-LDA, and RP-LDA ensemble, and under different data dimensionality and sample size settings. Firstly, we find that weight vector tuning can compensate performance loss due to poorly chosen native classifier hyperparameters. It thus eliminates the need for native hyperparameter tuning. As weight vector tuning occurs post weight vector generation, this can be advantageous in terms of computational efficiency when the native hyperparameters need to be set prior to weight vector generation. Secondly, we find that the parameterization significantly improves the performance of LDA under high estimation noise. We proceed to derive the parameterized LDA classifier misclassification probability in the RMT growth regime corresponding to these settings, in which the data dimensionality and sample size grow at comparable rates to each other. We also provide an estimator of the probability of misclassification which neither relies on additional data samples nor requires intensive computations, and thus can be used to tune the parameter of this classifier in a computationally efficient manner.

APPENDIX A

As in the derivation of Section II-A, assume x ∈ C i , where i is either 0 or 1 and assume the two classes have a common covariance matrix Σ. Then x ∼ µ i + Σ 1/2 z where z ∼ N (0, I).

Using the fact thatμ i = µ i + Σ 1/2 Z i 1 n i for some Z i with N (0, I) columns, i = 0, 1,x can be expressed asx 28 The first term in (14) can then be rewritten as

Note that the noise here is due to both the common covariance between the classes (this is the test point noise) as well as estimation noise from the sample means. Similarly, the second term can be expressed using (18) as

Alternatively, expressing (18) in terms ofμ rather than µ so that the orthogonal projection can be put to use yields a similar result. By using the fact that µ =μ + Σ 1/2 Z 0 1

can be expressed asx

and the second term in (14) as

From this perspective, the information in the test point combines with the sample estimation noise in the term Σ 1/2 Z i 1 n i . Note that even if we were to have equal samples so that n 0 = n 1 , we would still be able to discriminate the class of the test point through Z i . This is not immediately obvious as Z i , i = 0, 1, both have the same distribution, but can be observed asymptotically by computing the deterministic equivalents.

An analogous result to that of the previous section can be derived in the case when the class covariance matrices are distinct. In this case, x ∼ µ i + Σ 1/2 i z where z ∼ N (0, I). Using the fact

for some Z i with N (0, I) columns, i = 0, 1,

w Tμ µ Tμμ Tx = (−1) i+1 1 2

Note here that the noise is due only to estimation noise from the sample means, since the differing covariances between the two classes are informative.

Substituting (19) directly into the second term in (14) gives

Alternatively, expressing (19) in terms ofμ reveals the second term in (14) to be purely information. By using the fact that µ =μ +

and the second term in (14) becomes

APPENDIX B

Derivingε reduces to deriving the deterministic equivalentsm 0 ,m 1 ,σ 2 0 , andσ 2 1 reduces to deriving the deterministic equivalents of their constituent quadratic forms. This is the approach taken in what follows.

The following proofs rely heavily on three main facts. First, under the distinct covariance assumption on the class distributions,μ i = µ i + Σ 1/2 i Z i 1 n i for some Z i with N (0, I) columns, i = 0, 1. This follows from expressing the data matrices X i , i = 0, 1 as

Second, the sample means µ 0 and µ 1 are independent of the sample covariance Σ. This can be shown by simply plugging in

into the corresponding formula forΣ i . This yieldŝ

Since the terms Z i 1 n i and Z i I n i − Lastly,Σ can be expressed aŝ

for someZ 0 ∈ R p×(n 0 −1) andZ 1 ∈ R p×(n 1 −1) , both having columns distributed as N (0 p , I p ).

Using (20),

Since the terms 

where U 0 and U 1 have as their first columns the vectors 1n 0 √ n 0 and 1n 1 √ n 1 respectively. By using these same bases to eigendecompose I n 0 and I n 1 in (21), we obtain

is the submatrix of Z 0 obtained by removing its first column andZ 1 ∈ R p×(n 1 −1) is the submatrix of Z 1 obtained by removing its first column.

Now we are ready to derive the deterministic equivalents.

1) Derivation ofm 0 : The discriminant mean m 0 can be expressed as

Thus, the problem of deriving this deterministic equivalent can be further decomposed into deriving the following convergence statementŝ

The first two convergence statements are derived by using the fact thatμ i = µ i + [23] . The third and fourth terms involveΣ. Sinceμ andΣ are independent, the convergence can be split into stages.

For the third term, we first have the intermediate convergence result

and for the fourth term we have the intermediate convergence result

each obtained by dealing withμ as described above independently ofΣ.

Next, we expressΣ = WW T where W ∈ R p×(n−2) is defined as

The Moore-Osgood theorem allows the interchange of these limits. It is enough to show that the sequencesμ T Q γμ andμ T Q γ µ 0 −μ 0 +μ 1 2 converge uniformly. Since these sequences converge pointwise (this follows from convergence in probability), this can be shown by uniformly bounding their first derivative [24] .

We have

where the last line follows from the result in [25] which shows that λ min {Σ} > C for some constant C almost surely. Using the growth regime assumption (c) it can be shown that μ 2 2 is bounded. This completes the proof. The other term can be handled in a similar way.

We can now apply the result in [14] which manifests in equations (22) , (23) , (24) , (25) , and (26) . We can then take the limit as γ → 0.

Combining (23) and (24), we have

Combining (25) and (26), we have

These equations shows that δ(γ) and ν(γ) vary as 1 γ , and so they diverge as γ → 0 Combining (24), (23) , and (25), we havẽ

Combining (26), (23) , and (25), we havẽ

This pair of equations does not pose problems as γ → 0, therefore we work withδ(γ) andν(γ).

Taking the limit as γ → 0, (27) becomes

and (28) becomes

Although there are no closed-form solutions forδ(0) andν(0), these equations fit under the framework of a standard inference problem (see Definition 6.2 [26] ). The fixed point iteration algorithm stated in Theorem 2 is guaranteed to converge to a unique solution (δ(0),ν(0)) (see 

2) Derivation ofm 1 : Similarly, the problem of deriving this deterministic equivalent can be further decomposed into deriving the following additional convergence statementŝ

which can be proven in a similar way to the terms composing m 0 .

3) Derivation ofσ 2 0 : The discriminant variance σ 2 0 can be expressed as

The problem of deriving this deterministic equivalent can be further decomposed into deriving the following additional convergence statementŝ

The first two results can be shown using the same techniques as above. The third result needs special treatment, as it involves a double resolvent. Using the result for double resolvents in [14] , in conjunction with taking γ → 0, we havê

Similarly, the problem of deriving this deterministic equivalent can be further decomposed into deriving the following additional convergence statementŝ

The third convergence statement uses the result

.

The following proofs rely heavily on three main facts. Firstly, under the assumption that the two classes have common covariance Σ,μ i = µ i + Σ 1/2 Z i 1 n i for some Z i with N (0, I) columns, i = 0, 1. Secondly,μ i , i = 0, 1 are independent ofΣ. Finally,Σ can be expressed aŝ

for some Z ∈ R p×(n−2) which has i.i.d. entries distributed as N (0, 1). The proofs follow the same line of reasoning as those at the beginning of Section B-A.

1) Derivation ofm 0 : The problem of deriving this deterministic equivalent can be further decomposed into deriving the following convergence statementŝ

We now derive the second convergence statement in detail. It is mostly representative of the rest of the derivations. The termμ TΣ−1μ can be expressed aŝ N (0, 1) entries. Taking the expectation over Z i 1, i = 0, 1, while making use of the fact thatΣ is independent ofμ, and that Z i 1 n i ∼ N 0 p , 1 n i I p , i = 0, 1, we have the following intermediate convergence result

We have and W = V TZ ∈ R p×n−2 also has i.i.d entries distributed as N (0, 1) due to invariance of the Gaussian distribution to orthogonal transformations.

Using the results in [27] , we have

To be able to apply the above asymptotic result to this expression, we first need to justify the interchange of the limits in (29). This can be done using the Moore-Osgood theorem in a similar way to that shown in Section B-A1. 2) Derivation ofm 1 : Using the same approach as is used for derivingm 0 , it can be shown thatμ T μ 1 −μ 0 +μ 1 2 − 1 n 1 tr Σ 1 μ T µ 1 −μ 0 +μ 1 2 μ TΣ−1 μ 1 −μ 0 +μ 1 2 − n − 2 n 1 λ 1 μ TΣ−1 µ 1 −μ 0 +μ 1 2

3) Derivation ofσ 2 0 : Deriving the G-estimator for σ 2 0 decomposes into deriving G-estimators ofμ T Σ 0μ ,μ T Σ 0Σ . We can easily shoŵ µ TΣ 0μ μ T Σ 0μ which takes care of the first term. We will now show that 1 + 1 n−2 tr Σ 0Σ −1 which means that

The final expression is obtained by substituting the G-estimator of 1 n−2 tr Σ 0Σ −1 derived previously.

In a similar way, it can be shown that 

Chapter 5 -linear methods for classification

Performance of linear classification algorithms on α/γ discrimination for labr3: Ce scintillation detectors with various pulse digitizer properties

Linear classification scores in beef cattle as predictors of genetic merit for individual carcass primal cut yields

Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study

Strategy representation by decision trees with linear classifiers

A novel gene selection algorithm for cancer classification using microarray datasets

Recent advances of large-scale linear classification

Pattern classification

The elements of statistical learning

On the dimension effect of regularized linear discriminant analysis

Asymptotically bias-corrected regularized linear discriminant analysis for cost-sensitive binary classification

Bias-corrected diagonal discriminant rules for high-dimensional classification

High-dimensional linear discriminant analysis classifier for spiked covariance model

Spectral analysis of the gram matrix of mixture models

A direct approach to sparse discriminant analysis in ultra-high dimensions

Asymptotic analysis of an ensemble of randomly projected linear discriminants

From MathWorld-A Wolfram Web Resource

Regularized linear discriminant analysis and its application in microarrays

Random projections as regularizers: Learning a linear discriminant ensemble from fewer observations than dimensions

Asymptotic analysis of an ensemble of randomly projected linear discriminants

Handwritten zip code recognition with multilayer networks

Penalized discriminant analysis

Random matrix theory tutorial-Introduction to deterministic equivalents

Uniformly bounded derivative implies uniform convergence

On the smallest eigenvalue of general correlated gaussian matrices

Random matrix methods for wireless communications

On bilinear forms based on the resolvent of large random matrices

Asymptotic analysis of rzf in large-scale MU-MIMO systems over rician channels

Now define Q γ = WW T − γI p −1 , γ < 0. According to [14] ,andνThe expressions we are working with can be expressed in this notation as The proofs are similar to those in Section B-B13) Derivation ofσ 2 0 =σ 2 1 : The discriminant variance can be expressed asThe problem of deriving this deterministic equivalent can be reduced to deriving the following additional convergence statementŝThe last convergence claim involves a double resolvent and therefore we include its derivation here. Using the same technique as before to remove the randomness coming from the sample means, we can show thatUsing the result in [28] for double resolvents and by interchanging limits as before, we canshow that the double resolvent introduces a correction factor of 1 1− p n (in addition to the 1 1− p n introduced by each of the sample covariance matrices) and thus we havên−2 tr{ΣΣ −1 } and solving for the original quantity, we have