QBoost: Predicting quantiles with boosting for regression and binary classification


Expert Systems with Applications 39 (2012) 1687–1697
Contents lists available at ScienceDirect

Expert Systems with Applications

j o u r n a l h o m e p a g e : w w w . e l s e v i e r . c o m / l o c a t e / e s w a
QBoost: Predicting quantiles with boosting for regression and binary classification

Songfeng Zheng ⇑
Department of Mathematics, Missouri State University, 901 S. National Ave., Springfield, MO 65897, USA

a r t i c l e i n f o a b s t r a c t
Keywords:
Quantile regression
Boosting
Functional gradient algorithm
Binary classification
0957-4174/$ - see front matter � 2011 Elsevier Ltd. A
doi:10.1016/j.eswa.2011.06.060

⇑ Tel.: +1 417 836 6037; fax: +1 417 836 6966.
E-mail address: SongfengZheng@MissouriState.edu
In the framework of functional gradient descent/ascent, this paper proposes Quantile Boost (QBoost)
algorithms which predict quantiles of the interested response for regression and binary classification.
Quantile Boost Regression performs gradient descent in functional space to minimize the objective
function used by quantile regression (QReg). In the classification scenario, the class label is defined via
a hidden variable, and the quantiles of the class label are estimated by fitting the corresponding quantiles
of the hidden variable. An equivalent form of the definition of quantile is introduced, whose smoothed
version is employed as the objective function, and then maximized by functional gradient ascent to
obtain the Quantile Boost Classification algorithm. Extensive experimentation and detailed analysis show
that QBoost performs better than the original QReg and other alternatives for regression and binary
classification. Furthermore, QBoost is capable of solving problems in high dimensional space and is more
robust to noisy predictors.

� 2011 Elsevier Ltd. All rights reserved.
1. Introduction

Classical least square regression aims to estimate the condi-
tional expectation of the response Y given the predictor (vector)
x, i.e., E(Y|x). However, the mean value (or the conditional expecta-
tion) is sensitive to the outliers of the data (Koenker, 2005). There-
fore, if the data is not homogeneously distributed, we expect the
traditional least square regression to give us a poor prediction.

The sth quantile of a distribution is defined as the value such
that there is 100s% of mass on its left side. Compared to the mean
value, quantiles are more robust to outliers (Koenker, 2005). Let I(�)
be the indicator function with I(�) = 1 if the condition is true, other-
wise I(�) = 0. Let Qs(Y) be the sth quantile of random variable Y. It
can be proved (Hunter & Lange, 2000) that

Q sðYÞ¼ arg min
c

EY½qsðY � cÞ�;

where qs(r) is the ‘‘check function’’ (Koenker, 2005) defined by

qsðrÞ¼ rIðr P 0Þ�ð1 � sÞr: ð1Þ

Given training data {(xi, Yi), i = 1, . . . , n}, with predictor vector
xi e R

d and response Yi e R, let the sth conditional quantile of Y
given x be f(x). Similar to the least square regression, quantile
regression (QReg) (Koenker, 2005; Koenker & Bassett, 1978) aims
at estimating the conditional quantiles of the response given a
predictor vector x and can be formulated as
ll rights reserved.
f �ð�Þ¼ arg min
f

1
n

Xn
i¼1

qsðY i � fðxiÞÞ: ð2Þ

Compared to least square regression, quantile regression is ro-
bust to outliers in observations, and can give a more complete view
of the relationship between predictor and response. Furthermore,
least square regression implicitly assumes normally distributed er-
rors, while such an assumption is not necessary in quantile regres-
sion. Since being introduced in Koenker and Bassett (1978),
quantile regression has become a popular and effective approach
to statistical analysis with wide applications in economics
(Hendricks & Koenker, 1992; Koenker & Hallock, 2001), survival
analysis (Koenker & Geling, 2001), and ecology (Cade & Noon,
2003), to name a few.

The quantile regression model in Eq. (2) can be solved by linear
programming algorithms (Koenker, 2005) or Majorize-Minimize
algorithms (Hunter & Lange, 2000). However, when the predictor
x is in high dimensional space, the aforementioned optimization
methods for QReg might be inefficient. High dimensional problems
are ubiquitous in applications such as image analysis, gene se-
quence analysis, etc. To the best of our knowledge, the problem
of high dimensional predictor is not sufficiently addressed in QReg
literature, and this paper proposes a method for QReg which can
work in high dimensional spaces.

The proposed algorithm for QReg is based on boosting (Freund
& Schapire, 1997), which is well known for its simplicity and good
performance. The powerful feature selection mechanism of boost-
ing makes it suitable to work in high dimensional spaces. Fried-
man, Hastie, and Tibshirani (2000) developed a general statistical
framework which yields a direct interpretation of boosting as a

http://dx.doi.org/10.1016/j.eswa.2011.06.060
mailto:SongfengZheng@MissouriState.edu
http://dx.doi.org/10.1016/j.eswa.2011.06.060
http://www.sciencedirect.com/science/journal/09574174
http://www.elsevier.com/locate/eswa


1688 S. Zheng / Expert Systems with Applications 39 (2012) 1687–1697
method for function estimation, which is a ‘‘stage-wise, additive
model’’.

Consider the problem of function estimation

f �ðxÞ¼ arg min
f

E½lðY; fðxÞÞjx�;

where l(�, �) is a loss function which is typically differentiable and
convex with respect to the second argument. Estimating f⁄(�) from
the given data, {(xi,Yi), i = 1, . . . , n}, can be performed by minimizing
the empirical loss, n�1

Pn
i¼1 lðY i; fðxiÞÞ, and pursuing iterative steep-

est descent in functional space. This leads us to the generic func-
tional gradient descent algorithm (Friedman, 2001; Mason, Baxter,
Bartlett, & Frean, 2000). Fig. 1 shows the version summarized in
Bühlmann and Hothorn (2007).

Many boosting algorithms can be understood as functional
gradient descent with appropriate loss function. For example, if
we choose l(Y, f) = exp(�(2Y � 1)f), we would recover AdaBoost
(Friedman et al., 2000), and L2 Boost (Bühlmann & Yu, 2003) corre-
sponds to l(Y, f) = (Y � f)2/2.

Motivated by the gradient boosting algorithms (Friedman,
2001; Mason et al., 2000), this paper estimates the quantile regres-
sion function by minimizing the objective function in Eq. (2) with
functional gradient descent. In each step, we approximate the neg-
ative gradient of the objective function by a base function, and
grow the model in that direction. This results in the Quantile Boost
Regression (QBR) algorithm. In the binary classification scenario,
we define the class label via a hidden variable, and the quantiles
of the class label can then be estimated by fitting the correspond-
ing quantiles of the hidden variable. An equivalent form of the def-
inition of quantile is introduced, whose smoothed version is
employed as the objective function for classification. Similar to
QBR, functional gradient ascent is applied to maximize the objec-
tive function, which yields the Quantile Boost Classification
(QBC) algorithm. The obtained Quantile Boost (QBoost) algorithms
are computationally efficient and converge to local optima. More
importantly, they enable us to solve high dimensional problems
efficiently.

The rest of this paper is organized as follows: in Section 2, we
first apply the functional gradient descent to QReg, yielding the
QBR algorithm, and then we proceed to propose the approximation
Fig. 1. Generic function
of the objective function for binary classification and to maximize
the objective function with functional gradient ascent in order to
obtain the QBC algorithm; Section 3 discusses some computational
issues in the proposed QBR and QBC algorithms and introduces
implementation details; Section 4 presents the experimental re-
sults of the proposed QBR and QBC algorithms on benchmark
regression and binary classification datasets, and in-depth discus-
sions of the results are also presented in Section 4; finally, Section 5
summarizes this paper with a brief discussion for future research
directions.

2. Methods

The methods proposed in our research are presented in this sec-
tion. We first directly apply the functional gradient descent to the
quantile regression model, yielding the quantile boost regression
algorithm. We then propose a smooth approximation to the opti-
mization problem for the quantiles of binary response, and based
on this we further propose the quantile boost classification algo-
rithm with some discussions of the related methods.

2.1. Quantile boost regression

We consider the problem of estimating quantile regression
function in the general framework of functional gradient descent
with the loss function

lðY; fÞ¼ qsðY � fÞ¼ ðY � fÞIðY � f P 0Þ�ð1 � sÞðY � fÞ:

A direct application of the algorithm in Fig. 1 yields the Quantile
Boost Regression (QBR) algorithm, which is shown in Fig. 2.

Similar to AdaBoost, QBR enables us to select most informative
predictors if an appropriate base learner is employed, and this will
be demonstrated experimentally in Section 4.1.1.

There is a large volume of literature applying boosting to
regression problems, for example in Duffy and Helmbold (2002),
Freund and Schapire (1997), and Zemel and Pitassi (2001). How-
ever, all these methods estimate the mean value of the response,
not quantiles. Langford, Oliveira, and Zadrozny (2006) proposed
to use classification technique for estimating the conditional
al gradient descent.


Fig. 2. Quantile boost regression (QBR) algorithm.

S. Zheng / Expert Systems with Applications 39 (2012) 1687–1697 1689
quantile. For each given quantile value, their method trains a set of
classifiers {ct} for a series of t e [0, 1], and the testing stage calcu-
lates the average of the outputs of the classifiers. Therefore, com-
pared to the proposed QBR algorithm, this method is time
consuming. Furthermore, it is not clear how to perform variable
selection using the method in Langford et al. (2006). Takeuchi,
Le, Sears, and Smola (2006) and Li, Liu, and Zhu (2007) perform
quantile regression in Reproducing Kernel Hilbert Spaces (RKHS).
Although RKHS might be of very high dimensionality, the models
in Takeuchi et al. (2006) and Li et al. (2007) require quadratic pro-
gramming which is computationally more expensive than the pro-
posed QBR model. Moreover, the models in Takeuchi et al. (2006)
and Li et al. (2007) cannot perform variable selection as does
QBR. In an independent study which applied the functional gradi-
ent method, Kriegler & Berk (2007) weighted the absolute loss
function according to overestimates and underestimates, resulting
an algorithm similar to QBR. However, the method in (Kriegler &
Berk, 2007) uses multi-node regression tree as weak learner, and
needs to solve multiple optimization problems in each iteration,
therefore, it is more time consuming compared to QBR.

2.2. Quantile boost classification

2.2.1. Predicting quantiles of binary response
Consider the following model:

Y� ¼ hðxÞþ e and Y ¼ IfY� P 0g; ð3Þ

where Y⁄ is a continuous hidden variable, h(x) is the true model for
Y⁄, e is a disturb, and Y e {0, 1} is the binary class label for the obser-
vation x. The model in Eq. (3) was also considered in Koenker
(2005), Kordas (2002, 2006), and Manski (1985).

Let Qs(Y
⁄|x) be the sth conditional quantile of Y⁄ given x, and let

g(�) be a real monotone increasing function. Clearly

PðY� P yjxÞ¼ PðgðY�Þ P gðyÞjxÞ;

so it follows that

gðQ sðY
�jxÞÞ¼ Q sðgðY

�ÞjxÞ: ð4Þ

Since the indicator function, Iðt P 0Þ, is monotone increasing with
respect to t, Eq. (4) indicates that
IðQ sðY
�jxÞ P 0Þ¼ Q sðIðY

� P 0ÞjxÞ¼ Q sðYjxÞ;

that is, the conditional quantile function of Y can be obtained by fit-
ting the corresponding conditional quantile function of Y⁄ (Koenker,
2005; Kordas, 2006). Let the sth conditional quantile function of the
latent variable Y⁄ be f(x, b) with b as the parameter vector, i.e.,

Q sðY
�jxÞ¼ fðx; bÞ: ð5Þ

It follows that the conditional quantile function of the binary vari-
able Y can be modeled as

Q sðYjxÞ¼ Iðfðx; bÞ P 0Þ: ð6Þ

From the relation Y ¼ I ðY� P 0Þ, we have

PðY ¼ 1jxÞ¼ PðY� P 0jxÞ; ð7Þ

while Eq. (5) implies that

PðY� P fðx; bÞjxÞ¼ 1 � s: ð8Þ

Thus, if f(x, b) = 0, combining Eqs. (7) and (8) yields

PðY ¼ 1jxÞ¼ PðY� P fðx; bÞjxÞ¼ 1 � s:

If f(x, b) > 0,

PðY ¼ 1jxÞ > PðY� P fðx; bÞjxÞ¼ 1 � s:

In summary, we have the following relation:

PðY ¼ 1jxÞ¼<> 1 � s for fðx; bÞ¼<> 0; ð9Þ

which is an inequality of the posterior probability of the binary
class label given the predictor vector. Hence, if we make decision
with cut-off posterior probability value 1 � s, we need to fit the s
quantile regression function of the response. Once the model is
fitted, i.e., the parameter vector b is estimated as b, we can make
prediction by

Ŷ ¼ Iðfðx; bÞ P 0Þ;

where Ŷ is the predicted class label for the predictor vector x .

2.2.2. Quantile boost for binary classification
To fit the model for the sth quantile of Y, i.e., to estimate the

parameter vector b in Eq. (6), in a way similar to the method


1690 S. Zheng / Expert Systems with Applications 39 (2012) 1687–1697
applied to QBR, the estimated parameter vector b can be obtained
by solving:

b ¼ arg min
b

LðbÞ¼
Xn
i¼1

qs½Y i � Iðfðxi; bÞ P 0Þ�
( )

: ð10Þ

In Appendix A, it is proved that the above minimization problem is
equivalent to a maximization problem:

b ¼ arg max
b

SðbÞ¼
Xn
i¼1
½Y i �ð1 � sÞ�Iðfðxi; bÞ P 0Þ

( )
: ð11Þ

However, the function S(b) is not differentiable, because of the
use of the indicator function Iðfðx; bÞ P 0Þ. To apply gradient based
optimization methods, we replace Iðfðx; bÞ P 0Þ by its smoothed
version, and solve

b ¼ arg max
b

Sðb; hÞ¼
Xn
i¼1
½Y i �ð1 � sÞ�K

fðxi; bÞ
h

� �( )
; ð12Þ

where h is a small positive value, and K(t) is smoothed version of the
indicator function, Iðt P 0Þ, with the following properties (Kordas,
2006):

KðtÞ > 0; 8t 2 R; lim
t!1

KðtÞ¼ 1; lim
t!�1

KðtÞ¼ 0:

In this paper, we take K(�) as the standard normal cumulative distri-
bution function

UðzÞ¼
Z z
�1

/ðtÞdt with /ðzÞ¼
1ffiffiffiffiffiffiffi
2p
p e�

z2
2 : ð13Þ

Let each term in the objective function (12) be

lðY; fÞ¼ ½Y �ð1 � sÞ�K f =hð Þ:

Following the general steps of functional gradient ascent, we obtain
the Quantile Boost Classification (QBC) algorithm as shown in Fig. 3.

2.2.3. Relationship between Eq. (11) and error rate
Maximizing Eq. (11) has a close relation to the performance of

the final classifier Iðfðx; bÞ P 0Þ. Let each term in Eq. (11) be
Fig. 3. Quantile boost classifi
Si ¼ ½Y i �ð1 � sÞ�Iðfðxi; bÞ P 0Þ¼ sIðY i ¼ 1ÞIðfðxi; bÞ P 0Þ
�ð1 � sÞIðY i ¼ 0ÞIðfðxi; bÞ P 0Þ; ð14Þ

then Eq. (11) becomes

SðbÞ¼ s
Xn
i¼1

IðY i ¼ 1ÞIðfðxi; bÞ P 0Þ

�ð1 � sÞ
Xn
i¼1

IðY i ¼ 0ÞIðfðxi; bÞ P 0Þ

¼ sTP �ð1 � sÞFP ¼ sðPos � FNÞ�ð1 � sÞFP
¼ sPos �½sFN þð1 � sÞFP�; ð15Þ

where TP, FP, and FN are the true positive, false positive, and false
negative numbers of the classifier Iðfðx; bÞ P 0Þ, respectively; Pos
is the number of positive training examples.

Eq. (15) implies that maximizing Eq. (11) is equivalent to min-
imizing a convex combination of FP and FN, sFN + (1 � s)FP. If
s = 0.5, this is equivalent to minimizing the training error number,
whereas if s > 0.5, the minimization puts more weight on FN, while
FP is more emphasized if s < 0.5. This analysis is consistent with Eq.
(9): if s > 0.5, the classifier makes decision at posterior cut-off
probability 1 � s which is less than 0.5, and this will make FN
small; similarly, for s < 0.5, the posterior cut-off probability will
be greater than 0.5, which will lower the FP.

2.2.4. Related works
Kordas et al. (2002, 2006) proposed binary quantile regression

for the purpose of classification using quantiles. However, in binary
QReg, the simulated annealing algorithm is employed to perform
the optimization task. Although a local optimum is guaranteed,
simulated annealing is well known for its expensive computation,
and it is usually difficult to tell when the algorithm converges.
While QBC is based on gradient ascent, it yields a local maximum
and converges fast. Due to the expensive computation of simulated
annealing, binary QReg can only work in very low dimensional
spaces. However, in application, we frequently face hundreds or
even thousands of predictors, and it is often desired to find out
the informative predictors. Clearly, in this case, binary QReg is
not applicable. On the contrary, QBC is designed to work in high
cation (QBC) algorithm.


S. Zheng / Expert Systems with Applications 39 (2012) 1687–1697 1691
dimensional spaces, and it enables us to select the most informa-
tive variables by using certain types of base learner.

Hall, Titterington, and Xue (2009) proposed a median based
classifier which works in high dimensional space. For a given pre-
dictor vector, (Hall et al., 2009) compares the L1 distances from the
new predictor vector to the component-wise median vectors of the
positive and negative examples in the training set, and assigns
class label as the class with the smaller L1 distance. Although com-
putationally efficient, this simple nearest-neighbor-like algorithm
cannot perform variable selection as the proposed QBC algorithm.
Mease, Wyner, and Buja (2007) combines AdaBoost and sampling
techniques to predict the class quantiles, but the sampling tech-
nique might cause under utilization of the available information.
The method proposed in Langford et al. (2006) can also be applied
to classification tasks, however it is computationally more expen-
sive than QBC, and it is not clear how to select most informative
predictors as well.

3. Calculation

In the proposed QBR and QBC algorithms (as shown in Figs. 2
and 3, respectively), in a manner similar to the work in Friedman
(2001), let the base procedure be h(x, a) with a being a parameter
vector. The third step of QBR and QBC can be performed by an or-
dinary least square regression:

am ¼ arg min
a

Xn
i¼1
½Ui � hðxi; aÞ�

2
;

hence the function g[m](x) = h(x, am) may be regarded as an approx-
imation to the negative gradient by the base procedure for QBR, and
in the QBC algorithm, the function g[m](x) = h(x, am) is an approxi-
mation to the gradient.

In step 4 of QBR, the step size factor can be determined by a line
search algorithm (e.g., Fibonacci search or Golden section search
(Walsh, 1975, chap. 3)) as:

gm ¼ arg minc
Xn
i¼1

qs½Y i � f
½m�1�ðxiÞ� cg½m�ðxiÞ�:

For QBC, the line search algorithm for step size is

gm ¼ arg maxc
Xn
i¼1
½Y i �ð1 � sÞ�K

f ½m�1�ðxiÞþ cg½m�ðxiÞ
h

� �
:

Alternatively, for simplicity, in each iteration, we could update the
fitted function f[m�1](�) by a fixed, but small, step in the negative
gradient direction for QBR and in the gradient direction for QBC.
It is verified in Bühlmann and Hothorn (2007) that the size of gm
is less important as long as it is sufficiently small. A smaller value
of fixed step size gm typically requires a larger number of boosting
iterations and thus more computing time, while the predictive
accuracy has been empirically found to be good when choosing
gm ‘‘sufficiently small’’, e.g., gm = 0.1 (Friedman, 2001). To balance
the predictive accuracy and computational burden, Bühlmann and
Hothorn (2007) suggested a step size of 0.1 be chosen. By simula-
tions, (Friedman, 2001) showed that the performance is similar, if
the step size parameter is less than 0.125. Thus, in our experiments,
the step size parameter of QBR and QBC is fixed at gm = 0.1 for all m,
as suggested by Bühlmann and Hothorn (2007), Friedman (2001).

In our implementation of the QBC algorithm, the standard nor-
mal cumulative distribution function given in Eq. (13) was used as
an approximation of the indicator function with h = 0.1. From our
experience, QBC is not sensitive to the value of h as long as h < 0.5.

The proposed QBR and QBC algorithms, the median classifier in
Hall et al. (2009), and the AdaBoost face detector (Viola & Jones,
2001) were implemented using the programming language MAT-
LAB�; all other algorithms including the original quantile regression,
L1 norm quantile regression (Li & Zhu, 2008; Wu & Liu, 2009), Ada-
Boost (Freund & Schapire, 1997), LogitBoost (Friedman et al.,
2000), and L2-Boost (Bühlmann & Yu, 2003) were implemented in
statistics programming language R. All the experiments were con-
ducted on an ordinary PC with 3.0 GHz CPU and 3.25 GB memory.

4. Results and discussion

We applied the proposed QBR and QBC algorithms on extensive
regression and pattern classification problems, and this section will
present the experimental results with in-depth discussions.

4.1. Experiments on regression problems

This subsection tests the proposed QBR algorithm on various
regression problems with benchmark datasets in Section 4.1.1,
and compares QBR to L1 norm quantile regression (Li & Zhu,
2008; Wu & Liu, 2009) on simulated dataset in Section 4.1.2.

4.1.1. Results of QBR on benchmark datasets
Since the purpose of this paper is to propose an alternative algo-

rithm to the original QReg, we choose only to compare the perfor-
mance of QBR to that of the original QReg. Five regression datasets
from UCI machine learning repository were used in our experi-
ment: Red Wine Quality, White Wine Quality, Forest Fire, Concrete
Slump, and Concrete Compressive; see Table 1 for the basic infor-
mation of the datasets. The predictor and the response variables
were normalized to be in [�1, 1]. To make a fair comparison be-
tween QBR and QReg, we used simple linear regression with only
one predictor as base procedure in QBR.

In evaluation, we use the sum of the ‘‘check losses’’ on the
testing set as the error measure which is defined by

LðsÞ¼
XNtest
i¼1

qsðY i � f̂ðxiÞÞ;

where Ntest is the size of the testing set, and f̂ðxiÞ is the predicted sth
quantile at xi by the original QReg or QBR. By the definition of quan-
tile, a smaller value of L(s) gives an estimated quantile closer to the
true value.

For each dataset, we randomly select 80% of the examples as
training set and use the remaining 20% as the testing set. For three
s values (0.25, 0.5, and 0.75), QBR and QReg are separately trained
and evaluated on these subsets. The partition-training-testing pro-
cedure is repeated 500 times. The means and the standard devia-
tions of the 500 ‘‘check losses’’ are reported in Table 2. It is
readily observed that in all of the conducted experiments, QBR uni-
formly achieves smaller average check loss compared to QReg,
which indicates that QBR estimates more accurately.

To statistically compare the results, we conduct a z-test with

H0 : LQBR � LQReg P 0 vs: H1 : LBQR � LQReg < 0; ð16Þ

where LQBR and LQReg are the check losses of QBR and QReg, respec-
tively. By a standard z-test procedure (please refer to some elemen-
tary statistics textbooks, e.g., Navidi, 2010, chap. 6), we obtain the
p-values for each pair of comparison, as listed in Table 3. We note
that on 11 out of 15 comparative experiments, the p-values are less
than 2%, which indicates that in most cases, QBR performs signifi-
cantly better than QReg. The reason might be that in QBR, we
implicitly regularize the regression coefficients by applying a small
learning rate, i.e., using small step size gm.

In applications, variable selection is often needed, since it helps
us get a simpler model, making the model easier to interpret/
understand and identifying informative variables. Although the
relevant variables are usually unknown, we none the less prefer
a simpler model which is capable of deleting more variables:
Furthermore, a simpler model is more time efficient in making


Table 1
Basic information about the five datasets used for regression problems.

Dataset Red wine White wine Forest fire Concrete slump Concrete comp.

Size 1599 4898 517 103 1030
Dim. 11 11 12 7 9

Table 2
The comparison between QReg and QBR on the five datasets from UCI machine learning repository. The values listed are the mean values of the check loss of the 500 runs and the
standard deviations are listed in parentheses.

Dataset s = 0.25 s = 0.50 s = 0.75

QReg QBR QReg QBR QReg QBR

Red wine 19.71 (1.11) 19.41 (1.11) 24.30 (1.06) 24.07 (1.09) 19.40 (0.81) 19.16 (0.79)
White wine 62.40 (1.84) 62.23 (1.91) 78.69 (1.78) 78.60 (1.92) 62.03 (1.50) 61.78 (1.64)
Forest Fire 4.97 (0.54) 4.93 (0.54) 10.04 (0.82) 9.62 (0.78) 9.56 (0.92) 9.14 (0.73)
Conc. slump 0.74 (0.14) 0.71 (0.12) 1.00 (0.17) 0.92 (0.16) 0.95 (0.19) 0.84 (0.17)
Conc. comp. 15.02 (0.77) 14.91 (0.79) 18.21 (1.08) 18.17 (1.00) 13.63 (0.83) 13.52 (0.77)

Table 3
The p-values for the hypothesis testing problem in Eq. (16) on the five datasets for each quantile. 0+ indicates the value is less than 0.0001.

Dataset Red wine White wine Forest fire Concrete slump Concrete comp.

s = 0.25 0+ 0.0769 0.1132 0+ 0.0158
s = 0.50 0+ 0.2054 0+ 0+ 0.2596
s = 0.50 0+ 0.0044 0+ 0+ 0.0188

Table 4
Comparison between QReg and QBR on variable selection on concrete slump data. The experiment was repeated 500 times. The listed are the average number of selected noisy
variables (N.V.), the mean of check loss error with the standard deviation in parentheses.

Method s = 0.25 s = 0.50 s = 0.75

# of N.V. Error # of N.V. Error # of N.V. Error

QReg 8.13 0.99 (0.16) 6.97 1.38 (0.21) 9.69 1.28 (0.24)
QBR 1.04 0.97 (0.19) 0.63 1.14 (0.21) 0.98 1.27 (0.30)

1692 S. Zheng / Expert Systems with Applications 39 (2012) 1687–1697
prediction. In our experiments, both QReg and QBR obtain a linear
function of the predictors. Since the predictors and response are
normalized to be in [�1, 1], it makes sense if we delete the
variables, when their coefficients are too small, thus performing
variable selection. Twenty noisy predictors are added to the Con-
crete Slump data, each of which is generated uniformly at random
from [�1, 1]. Thus, in the noisy dataset, only a small fraction (7 out
of 27) of the variables are informative. Then we repeat the above
experiment. In the linear model obtained, we calculate the sum
of the absolute values of all the coefficients. If for any predictor,
the absolute value of its estimated coefficient is less than 1% of
the total sum, it is trimmed. We calculate the average number
of the selected noisy predictors of QReg and QBR from the 500
runs. The means and standard deviations of the check losses on
testing set are also calculated. Table 4 summarizes the results,
from which it is evident that for each s value, compared to the
original QReg, QBR selects far fewer noisy predictors while achiev-
ing smaller mean error. This experiment shows that compared to
QReg, QBR is more robust to noisy predictors.

The procedure of QBR enables us to use other forms of base lear-
ner, for example, regression trees (Friedman, 2001; Friedman et al.,
2000), and this provides us flexibility in certain applications. Con-
trarily, it is not clear how to make use of other forms of regression
in the framework of the original QReg.
4.1.2. Comparing to L1 norm quantile regression
QBR belongs to the stage-wise additive model, which has a close

connection to L1 constrained models (Efron, Hastie, Johnstone, &
Tibshirani, 2004). Recently, L1 quantile regression (L1-QReg) (Li &
Zhu, 2008; Wu & Liu, 2009) was proposed, which imposes L1
constraint on the coefficient vector in linear quantile regression.
Therefore, it is natural to investigate the relationship between
QBR and L1-QReg.

Given data {(xi, Yi), i = 1, . . . , n}, assuming the use of the linear
model b0 + b

Tx for estimating the sth quantile of the response,
the L1-QReg can be formulated as

min
b0;b

Xn
i¼1

qsðY i � b0 � b
T xiÞ;

subject to jb1jþ �� �þ jbpj6 s:

The solution of L1-QReg depends on the penalty parameter s, and
changing the parameter s gives us the evolution of the solution,
which is called the solution path (Efron et al., 2004; Li & Zhu,
2008). Similarly, in the QBR algorithm with simple linear regression
as base procedure, we can observe the evolution of the regression
coefficients with the iteration number.

The relationship between L1-QReg and QBR was investigated by
experimentation with simulated data generated from the model

Y ¼ bT x þ e;

where b = (3, �1.5, 0, 0, 2, 0, 0, 0)T, the error term e follows standard
double exponential distribution, and x e R8 with each component
xi � Nð0; 1Þ and x1, . . . , x8 are independent. Five hundred training
examples and 10,000 testing observations were generated from
the same model. Define the closeness measure between the esti-
mated model and the true model as

P8
i¼1jbi � b̂ij, where b̂1; � � � ; b̂8

are the estimated regression coefficients.


S. Zheng / Expert Systems with Applications 39 (2012) 1687–1697 1693
Fig. 4 shows the solution paths and the check losses of the L1-
QReg and QBR using simple linear regression as the base proce-
dure. Firstly, we observe that both L1-QReg and QBR select the
informative predictors, i.e., x1, x2, and x5, and all other predictors
have coefficients very close to zero. The final coefficients estimated
by L1-QReg and QBR are listed in Table 5. The two estimates are
clearly very similar with the QBR solution slightly closer to the true
model. The closeness measures are 0.3813 for L1-QReg and 0.3184
for QBR, respectively. Secondly, it is observed that the shape of the
solution paths are strikingly similar, which means that each of the
solutions of L1-QReg corresponds to a solution of QBR with certain
iterations. Thirdly, the average check loss curves show that the
trends in the check loss are very similar. The final average check
losses are 0.5142 for L1-QReg and 0.5132 for QBR.
Fig. 4. Solution paths and the average check losses on testing data of L1-QReg and QBR. InP8
i¼1jb̂ij.

Table 5
The final estimated regression coefficients by L1-QReg and QBR.

Coefficients b̂1 b̂2 b̂3 b̂4

L1-QReg 2.9633 �1.5773 0.0611 �0.0
QBR 2.9619 �1.5749 0.0516 �0.0
This comparison demonstrates that the L1-QReg and the pro-
posed QBR have similar properties and similar performances in
variable selection and prediction accuracy. Despite the similarity
when QBR employs simple linear regression as the base learner,
as previously mentioned, QBR enjoys the flexibility of being able
to use other forms of base learners, while the L1-QReg in Li and
Zhu (2008), Wu and Liu (2009) can only be used for linear model.

4.2. Experiments on binary classification problems

In this subsection, three experiments were conducted to study
the performance of the proposed QBC algorithm. In the first two
experiments, we made decision at the cut-off posterior probability
0.5, and therefore we fit QBC with s = 0.5. Since QBC is a boosting
the plots for L1-QReg, the x-axis is the sum of the absolute values of the solution, i.e.,

b̂5 b̂6 b̂7 b̂8

403 1.9287 0.0314 �0.0119 0.0147
283 1.9318 0.0189 �0.0004 0.0142


Table 7
The testing errors of different algorithms on the credit score dataset.

Data L2Boost
(%)

AdaBoost
(%)

LogitBoost
(%)

QBC
(%)

Median classifier
(%)

Clean 28.68 29.99 28.81 28.50 35.04
Noisy 31.92 30.05 32.68 28.55 38.94

1694 S. Zheng / Expert Systems with Applications 39 (2012) 1687–1697
procedure, for the comparative purpose, we, for the most part, con-
sidered only several of the most popular boosting based classifiers.
These include AdaBoost (Freund & Schapire, 1997), LogitBoost
(Friedman et al., 2000), and L2-Boost (Bühlmann & Yu, 2003). We
also tested the Median Classifier (Hall et al., 2009) and compared
the performance. The third experiment compared the performance
of AdaBoost to those of QBC, with s = 0.5 and s = 0.25 on face detec-
tion problem. The role of s in QBC algorithm was also illustrated.
4.2.1. Results on credit score data
We compare the result of QBC to that of the binary QReg (Kor-

das et al., 2002) on the German bank credit score dataset which is
available from UCI machine learning repository. The dataset is of
size 1000 with 300 positive examples, each has 20 predictors nor-
malized to be in [�1, 1]. In Kordas et al. (2002), only 8 predictors
were preselected to fit the binary QReg due to the expensive sim-
ulated annealing algorithm it employs in the optimization stage.

Without preselecting variables, we randomly selected a training
set of size 800 from the dataset and evaluated the learned QBC clas-
sifier on the other 200 examples. The QBC training algorithm was ran
for 100 iterations using simple linear regression with only one pre-
dictor as base learner. It is therefore fair to compare the performance
of QBC to that of the binary QReg. The splitting-training-testing pro-
cess was repeated 500 times, and the mean training and testing error
rates were reported in Table 6, which also lists the performance of
the binary QReg from Kordas et al. (2002). It is clear that QBC outper-
forms binary QReg on both training and testing sets. This is due to
the efficient computation of QBC, which allows the algorithm to ex-
plore more predictors and thus select more informative ones.

Fig. 5 shows the training error and the average objective func-
tion (see Eq. (12)) as the QBC training algorithm proceeds. As we
can see, the objective function increases monotonically and mean-
Table 6
Comparison between binary QReg and QBC on credit score dataset. ‘‘Clean’’ means the
original dataset, and ‘‘Noisy’’ means 20 noisy predictors are added to the dataset.

Dataset QBC Binary QReg

Training (%) Testing (%) Training (%) Testing (%)

Clean 19.84 24.47 21.9 26.5
Noisy 22.53 25.64 NA NA

Fig. 5. The average objective function and the training error for a particular run of
QBC on the credit score dataset. The training error is scaled by 0.5 to make the
curves comparable.
while, the training error decreases. This is consistent with our the-
oretical analysis in Section 2.2.3. We can also see that the training
algorithm converges with about 50 iterations.

To test the variable selection ability of QBC, twenty noisy pre-
dictors were generated from the uniform distribution on [�1, 1]
and were added to the original dataset. The above procedure was
repeated 500 times. Table 6 also lists the average training and test-
ing errors of QBC with, on average, only one noisy variable selected.
Our result demonstrates that QBC performs only slightly worse on
noisy data, indicating that QBC is robust to noise. Due to the expen-
sive computation of binary QReg, its performance on noisy data is
not provided.

Unlike binary QReg, QBC enjoys the flexibility of using other
forms of weak learners. We also compared QBC to the alternatives
mentioned at the beginning of Section 4.2 on the credit score data-
set with and without noisy predictors. All the boosting based algo-
rithms used the decision stump (Dettling & Bühlmann, 2003; Viola
& Jones, 2001) as the base procedure for fair comparison, and all
algorithms were ran 500 times on sets of 800 training and 200 test-
ing examples, all randomly selected. The average testing error rates
are listed in Table 7, and show that compared to the alternative
methods, QBC achieves the best performance on both clean and
noisy data. Again, we observe that QBC deteriorates only slightly
on the noisy data, verifying its robustness to noisy predictors.

4.2.2. Results on gene expression data
We compare QBC to the alternative methods on three publicly

available datasets in bioinformatics (Dettling & Bühlmann, 2003):
Estrogen, Nodal, and Colon. In these datasets, each predictor is the
measure of the gene expression level, and the response is a binary
variable. Although we can access thousands of genes, due to the high
cost of experiment, we usually can only get a very limited number
(50–100) of examples. See Table 8 for the sizes and dimensions of
these datasets. The high dimensionality makes binary QReg in Kor-
das et al. (2002, 2006) unaffordable. All the boosting-based algo-
rithms used the decision stump as base learner for fair comparison.

Since all the datasets have small size n, the leave-one-out (LOO)
cross validation is carried out to estimate the classification accu-
racy. That is, we put aside the ith observation and trained the clas-
sifier on the remaining (n � 1) data points. We then applied the
learned classifier to get Ŷ i, the predicted class label of the ith obser-
vation. This procedure is repeated for all the n observations in the
dataset, so that each one is held out and predicted exactly once.
The LOO error number (NLOO) and rate (ELOO) are determined by

NLOO ¼
Xn
i¼1

IðŶ i – Y iÞ and ELOO ¼
1
n

Xn
i¼1

IðŶ i – Y iÞ:

Table 8 summarizes the LOO classification error numbers and rates
of the considered classifiers on the three datasets, with the best per-
formance marked in bold. From Table 8, it is readily seen that QBC
has the best LOO performance. It can also be observed that the Med-
ian Classifier (Hall et al., 2009) is not stable – sometimes the perfor-
mance is very good, sometimes the performance is very poor.

In LOO cross validation, the training and testing sets are highly
unbalanced, which will affect the evaluation result. To provide
more thorough results, we have also conducted 5-fold cross valida-
tion (5-CV), in which each dataset was randomly partitioned into


Table 8
The leave-one-out error numbers and rates of the considered algorithms on the three gene expression datasets. For each algorithm, the misclassification numbers are provided
and the error rates are listed in parentheses. The best performances are displayed in bold. The size and dimension of each dataset are also given.

Dataset Size Dim L2-Boost AdaBoost LogitBoost QBC Median classifier

Estrogen 49 7129 4 (8.16%) 5 (10.20%) 6 (12.24%) 4 (8.16%) 6 (12.24%)
Colon 62 2000 10 (16.13%) 11 (17.74%) 10 (16.13%) 9 (14.52%) 9 (14.52%)
Nodal 49 7129 11 (22.45%) 12 (24.49%) 9 (18.37%) 8 (16.33%) 18 (36.73%)

Table 9
The 5-fold cross validation mean error rates and the standard deviations (in parentheses) of the considered algorithms on the three gene expression datasets. The best mean error
rates for a dataset are displayed in bold.

Dataset L2-Boost (%) AdaBoost (%) LogitBoost (%) QBC (%) Median classifier (%)

Estrogen 21.11 (10.91) 17.11 (10.50) 15.11 (12.87) 13.33 (9.30) 13.38 (11.04)
Colon 24.62 (5.95) 21.41 (7.68) 21.41 (7.68) 21.67 (9.50) 14.58 (9.20)
Nodal 31.78 (17.67) 24.89 (8.52) 20.44 (13.67) 20.00 (9.30) 42.84 (15.73)

Fig. 6. On the face detection problem, the testing error rates of AdaBoost, QBC with
s = 0.5, and QBC with s = 0.25.

S. Zheng / Expert Systems with Applications 39 (2012) 1687–1697 1695
five parts of roughly equal size. In every experiment, one part was
used as the testing set, and the other four parts were used as the
training set. Table 9 lists the mean and the standard deviation of
the error rates for the five runs of each algorithm on every dataset.
We observe from Table 9 that QBC yields the best performance on
two out of the three datasets. On the Colon dataset, QBC performs
better than L2-Boost, similar to LogitBoost and AdaBoost, but worse
than the Median Classifier. However, due to the instability of the
Median Classifier, we can still safely conclude that QBC performs
better than or similar to several of the alternatives.

4.2.3. Face detection
Face detection is a classical problem in computer vision, and

many machine learning algorithms have been successfully applied
to face detection, e.g., Support Vector Machines (Osuna, Freund, &
Girosit, 1997), and AdaBoost (Viola & Jones, 2001). We applied the
proposed QBC algorithm to a dataset with 3000 face and 3000 non-
face images, each of size 16 � 16. The training set includes 1500
face and 1500 nonface images randomly selected from the whole
dataset and the rest of the images were used for testing. Emulating
(Viola & Jones, 2001), we extracted different types of Haar features
from the images, taking a minimum rectangular size of 4 by 4 (for
example: 4 by 5, 8 by 4, 10 by 11), resulting in a total of around
6000 features. The AdaBoost face detector follows (Viola & Jones,
2001), and QBC employs regression stump (Torralba, Murphy, &
Freeman, 2004) as the weak learner since it is similar to the weak
learner used in Viola and Jones (2001).

We first compare the classification error rates of AdaBoost to
that of QBC with s = 0.5 and show the testing error curves in
Fig. 6. It is observed that AdaBoost slightly outperforms QBC with
s = 0.5. To have a closer look, we plot out the curves of false posi-
tive rates and false negative rates for the two methods in Fig. 7. It is
clear that after 40 iterations, AdaBoost and QBC (with s = 0.5) have
similar false positive rates, but AdaBoost has a smaller false nega-
tive rate, and this is the reason why AdaBoost has a slightly better
performance in total error rate.

We further notice in Fig. 7 that the false positive rates are much
higher than the false negative rates. Therefore, on this dataset, in
order to improve the classification accuracy of QBC, we have to de-
crease the false positive rate. According to the theoretical analysis
in Section 2.2.3, if s < 0.5, the algorithm will pay more attention to
decreasing the false positive rate. As such, we fit another QBC clas-
sifier with s = 0.25, and Fig. 7 also shows the corresponding false
positive and false negative rates. It is observed that although the
false negative rate of QBC with s = 0.25 is a little bit higher than
those of QBC with s = 0.5 and AdaBoost, the false positive rate is
significantly lower than those of QBC with s = 0.5 and AdaBoost.
As a consequence, the overall testing error rate is significantly
dropped, as shown in Fig. 6. We further notice that the perfor-
mance curves of AdaBoost are much more erratic than those of
QBC, which implies that the performance of AdaBoost is less stable.

There are modifications to AdaBoost given in literature for bal-
ancing FP and FN, e.g., in Masnadi-Shirazi and Vasconcelos (2007)
and Sun, Kamel, Wong, and Wang (2007). This experiment was in-
tended to demonstrate the role of s in the QBC algorithm, thus we
choose not to compare QBC with different s to different versions of
AdaBoost for imbalanced penalties.

As a comparison to AdaBoost and QBC, the median based classi-
fier reported in Hall et al. (2009) was also tested on the face detec-
tion problem. The test was repeated 20 times, since this algorithm
is time efficient. In each repetition, we randomly split the whole
dataset so that both the training set and testing set include 1500
face and 1500 nonface images. The mean error rate of the 20 tests
is 14.42%, the mean false positive rate is 21.78%, and the mean false
negative rate is 7.07%. Comparing these numbers to Figs. 6 and 7,
we see that the performance of the Median Classifier in Hall
et al. (2009) is much worse than those of AdaBoost and QBC.


Fig. 7. On the face testing set, false positive and false negative rates of AdaBoost, QBC with s = 0.5, and QBC with s = 0.25. Note that the y-axis have different scales in the two
figures.

1696 S. Zheng / Expert Systems with Applications 39 (2012) 1687–1697
5. Conclusions

Motivated by the gradient boosting framework, this paper ap-
plies functional gradient descent to fit quantile regression (QReg),
resulting Quantile Boost Regression (QBR) algorithm. In binary
classification problems, the class label is defined via a hidden var-
iable, and predicting the quantiles of the binary response is re-
duced to predicting the corresponding quantiles of the hidden
variable. An equivalent form of the definition of quantile was intro-
duced, and its smoothed version was further employed as the
objective function. Functional gradient ascent was applied to max-
imize the objective function for fitting the classifier, yielding the
Quantile Boost Classification (QBC) algorithm. The proposed Quan-
tile Boost (QBoost) algorithms yield local optima, and more impor-
tantly, they enable us to solve problems in high dimensional spaces
and to select informative variables/features.

The QBoost algorithms were tested extensively on various
regression and binary classification problems with benchmark
datasets. On most of the regression experiments, QBR performs sig-
nificantly better than the original QReg, in terms of check loss func-
tion. Moreover, the comparative experiment on noisy data indicates
that QBR is more robust to noise. The comparative result on a credit
score dataset shows that QBC outperforms binary QReg (Kordas
et al., 2002) and is more robust to noisy predictors. The experiments
on credit score dataset and gene expression data analysis demon-
strate that QBC performs better than or similar to other alternatives
in terms of leave-one-out and five fold cross validation error rates.
QBC was compared to the popular AdaBoost classifier (Viola & Jones,
2001) on face detection problem and we demonstrated the role of
the desired s value in QBC algorithm. On face detection and gene
expression data analysis, binary QReg (Kordas et al., 2002) is not
applicable due to the high dimensionality of the datasets.

The current version of QBC is for two-class problems, and we
plan to develop the multi-class version of QBC by reducing it to a
series of two-class tasks, for example, by one-versus-all (Freund
and Schapire, 1997) or Error-Correcting Output Codes (Dietterich
& Bakiri, 1995). A more thorough comparison of the proposed
QBR/QBC to other quantile based regression/classification algo-
rithms on more datasets is also desired.

Acknowledgments

The author gratefully acknowledges the distinction and support
provided by Sociedad Mexicana de Inteligencia Artificial (SMIA) and
the 9th Mexican International Conference on Artificial Intelligence
(MICAI-2010) in order to enhance, improve, and publish this work.
The author also thanks the reviewers of MICAI and Journal of Expert
Systems with Applications for their constructive comments and
would like to extend his gratitude to Prof. Alejandro Peña-Ayala
for his excellent work in coordinating the preparation, the review-
ing, and the revising of the manuscript. Prof. Shelby J. Kilmer pro-
vided a great deal of help with the English in this manuscript. This
work was partially supported by 2011 Summer Faculty Fellowship
of Missouri State University.

Appendix A. The equivalence between Eqs. (10) and (11)

In Eq. (10), let

Li ¼ qsðY i � Iðfðxi; bÞ P 0ÞÞ¼
qsðY i � 1Þ; if fðxi; bÞ P 0
qsðY iÞ; if fðxi; bÞ < 0

�
;

while

qsðY iÞ¼ sIðY i ¼ 1Þ and qsðY i � 1Þ¼ ð1 � sÞIðY i ¼ 0Þ:

Thus,

Li ¼ð1 � sÞIðY i ¼ 0ÞIðfðxi; bÞ P 0Þþ sIðY i ¼ 1ÞIðfðxi; bÞ < 0Þ
¼ ð1 � sÞIðY i ¼ 0ÞIðfðxi; bÞ P 0Þþ sIðY i ¼ 1Þ½1 � Iðfðxi; bÞ P 0Þ�
¼�sIðY i ¼ 1ÞIðfðxi; bÞ P 0Þþð1 � sÞIðY i ¼ 0ÞIðfðxi; bÞ P 0Þ

þ sIðY i ¼ 1Þ¼�Si þ sIðY i ¼ 1Þ;

where Si is defined in Eq. (14). Therefore, Eq. (10) reads,

LðbÞ¼
Xn
i¼1

Li ¼
Xn
i¼1
½�Si þ sIðY i ¼ 1Þ� ¼�SðbÞþ s

Xn
i¼1

IðY i ¼ 1Þ

¼�SðbÞþ sPos; ðA:1Þ

where Pos is the number of positive examples, which is a constant.
Eq. (A.1) shows the equivalence between Eqs. (10) and (11).

References

Bühlmann, P., & Hothorn, T. (2007). Boosting algorithms: Regularization, prediction
and model fitting. Statistical Science, 22, 477–505.

Bühlmann, P., & Yu, B. (2003). Boosting with the L2-loss: Regression and
classification. Journal of the American Statistical Association, 98, 324–340.

Cade, B. S., & Noon, B. R. (2003). A gentle introduction to quantile regression for
ecologists. Frontiers in Ecology and the Environment, 1(8), 412–420.

Dettling, M., & Bühlmann, P. (2003). Boosting for tumor classification with gene
expression data. Bioinformatics, 19, 1061–1069.


S. Zheng / Expert Systems with Applications 39 (2012) 1687–1697 1697
Dietterich, T. G., & Bakiri, G. (1995). Solving multiclass learning problems via error-
correcting output codes. Journal of Artificial Intelligence Research, 2, 263–286.

Duffy, N., & Helmbold, D. (2002). Boosting methods for regression. Machine Learning,
47, 153–200.

Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression.
Annals of Statistics, 32, 407–499.

Freund, Y., & Schapire, R. (1997). A decision-theoretic generalization of on-line
learning and an application to boosting. Journal of Computer and System Sciences,
55, 119–139.

Friedman, J. (2001). Greedy function approximation: A gradient boosting machine.
Annals of Statistics, 29, 1189–1232.

Friedman, J., Hastie, T., & Tibshirani, R. (2000). Additive logistic regression: A
statistical view of boosting. Annals of Statistics, 28, 337–407.

Hall, P., Titterington, D. M., & Xue, J. H. (2009). Median-based classifiers for high-
dimensional data. Journal of the American Statistical Association, 104, 597–1608.

Hendricks, W., & Koenker, R. (1992). Hierarchical spline models for conditional
quantiles and the demand for electricity. Journal of the American Statistical
Association, 93, 58–68.

Hunter, D. R., & Lange, K. (2000). Quantile regression via an MM algorithm. Journal of
Computational and Graphical Statistics, 9, 60–77.

Koenker, R. (2005). Quantile regression (1st ed.). New York: Cambridge University
Press.

Koenker, R., & Bassett, G. (1978). Regression quantiles. Econometrica, 46, 33–50.
Koenker, R., & Geling, R. (2001). Reappraising medfly longevity: A quantile

regression survival analysis. Journal of the American Statistical Association, 96,
458–468.

Koenker, R., & Hallock, K. (2001). Quantile regression. Journal of Economic
Perspectives, 15, 143–156.

Kordas, G. (2002). Credit scoring using binary quantile regression. In Y. Dodge, (Ed.),
Statistical data analysis based on the l1-norm and related methods.

Kordas, G. (2006). Smoothed binary regression quantiles. Journal of Applied
Econometrics, 21, 387–407.

Kriegler, B., & Berk, R. (2007). Boosting the quantile distribution: A cost-sensitive
statistical learning procedure. Technical Report, Department of Statitics,
University of California, Los Angeles.

Langford, J., Oliveira, R., & Zadrozny, B. (2006). Predicting conditional quantiles via
reduction to classification. In Proceedings of uncertainty in artificial intelligence.
Li, Y., & Zhu, J. (2008). L1-Norm quantile regression. Journal of Computational and
Graphical Statistics, 17, 163–185.

Li, Y., Liu, Y., & Zhu, J. (2007). Quantile regression in reproducing kernel Hilbert
spaces. Journal of the American Statistical Association, 102, 255–268.

Manski, C. F. (1985). Semiparametric analysis of discrete response: Asymptotic
properties of the maximum score estimator. Journal of Economics, 27, 313–333.

Masnadi-Shirazi, H., & Vasconcelos, N. (2007). Asymmetric boosting. In Proceedings
of international conference on machine learning (ICML).

Mason, L., Baxter, J., Bartlett, P. L., & Frean, M. (2000). Boosting algorithms as
gradient descent. In Proceedings of advances in neural information processing
systems (NIPS).

Mease, D., Wyner, A. J., & Buja, A. (2007). Boosted classification trees and class
probability/quantile estimation. Journal of Machine Learning Research, 8,
409–439.

Navidi, W. (2010). Statistics for engineers and scientists (3rd ed.). New York: McGraw-
Hill.

Osuna, E., Freund, R., & Girosit, F. (1997). Training support vector machines: An
application to face detection. In Proceedings of IEEE conference on computer vision
and pattern recognition (CVPR).

Sun, Y., Kamel, M., Wong, A., & Wang, Y. (2007). Cost-sensitive boosting for
classification of imbalanced data. Pattern Recognition, 40, 3358–3378.

Takeuchi, I., Le, Q. V., Sears, T. D., & Smola, A. J. (2006). Nonparametric quantile
estimation. Journal of Machine Learning Research, 7, 1231–1264.

Torralba, A., Murphy, K. P., & Freeman, W. T. (2004). Sharing features: Efficient
boosting procedures for multiclass object detection. In Proceedings of IEEE
conference on computer vision and pattern recognition (CVPR).

Viola, P., & Jones, M. (2001). Rapid object detection using a boosted cascade of
simple features. In Proceedings of IEEE conference on computer vision and pattern
recognition (CVPR).

Walsh, G. R. (1975). Methods of optimization. London: John Wiley and Sons.
Wu, Y., & Liu, Y. (2009). Variable selection in quantile regression. Statistica Sinica, 19,

801–817.
Zemel, R., & Pitassi, T. (2001). A gradient-based boosting algorithm for regression

problems. In Proceedings of advances in neural information processing systems
(NIPS).


	QBoost: Predicting quantiles with boosting for regression and binary classification
	1 Introduction
	2 Methods
	2.1 Quantile boost regression
	2.2 Quantile boost classification
	2.2.1 Predicting quantiles of binary response
	2.2.2 Quantile boost for binary classification
	2.2.3 Relationship between Eq. (11) and error rate
	2.2.4 Related works


	3 Calculation
	4 Results and discussion
	4.1 Experiments on regression problems
	4.1.1 Results of QBR on benchmark datasets
	4.1.2 Comparing to L1 norm quantile regression

	4.2 Experiments on binary classification problems
	4.2.1 Results on credit score data
	4.2.2 Results on gene expression data
	4.2.3 Face detection


	5 Conclusions
	Acknowledgments
	Appendix A The equivalence between Eqs. (10) and (11)
	References