Re-sampling Strategies for Regression

Lúıs Torgo1,2, Paula Branco1,2, Rita P. Ribeiro1,2, and Bernhard
Pfahringer3

1LIAAD - INESC TEC
2 DCC - Faculdade de Ciências - Universidade do Porto

3 Department of Computer Science - University of Waikato
ltorgo@dcc.fc.up.pt, paobranco@gmail.com, rpribeiro@dcc.fc.up.pt,

bernhard@cs.waikato.ac.nz

February 22, 2016

Abstract

Several real world prediction problems involve forecasting rare val-
ues of a target variable. When this variable is nominal we have a prob-
lem of class imbalance which was thoroughly studied within machine
learning. For regression tasks, where the target variable is continu-
ous, few works exist addressing this type of problem. Still, important
applications involve forecasting rare extreme values of a continuous
target variable. This paper describes a contribution to this type of
tasks. Namely, we propose to address such tasks by re-sampling ap-
proaches that change the distribution of the given data set to decrease
the problem of imbalance between the rare target cases and the most
frequent ones. We present two modifications of well-known re-sampling
strategies for classification tasks: the under-sampling and the Smote
methods. These modifications allow the use of these strategies on re-
gression tasks where the goal is to forecast rare extreme values of the
target variable. In an extensive set of experiments we provide empir-
ical evidence for the superiority of our proposals for these particular
regression tasks. The proposed re-sampling methods can be used with
any existing regression algorithm, which means that they are general
tools for addressing problems of forecasting rare extreme values of a
continuous target variable.

1 Introduction

Forecasting rare extreme values of a continuous variable is very relevant for
several real world domains (e.g. finance, ecology, meteorology, etc.). This
problem can be seen as equivalent to classification problems with imbal-
anced class distributions which have been studied for a long time within

1


machine learning (e.g. Domingos (1999); Elkan (2001); Zadrozny (2005);
Chawla (2005)). The main difference is the fact that we have a target nu-
meric variable, i.e. a regression task. This type of problem is particularly
difficult because: i) there are few examples with the rare target values; ii)
the errors of the learned models are not equally relevant because the user’s
main goal is predictive accuracy on the rare values; and iii) standard pre-
diction error metrics are not adequate to measure the quality of the models
given the preference bias of the user.

The existing approaches for the classification scenario can be cast into 3
main groups (Zadrozny, 2003; Ling and Sheng, 2010): i) change the evalu-
ation metrics to better capture the application bias; ii) change the learning
systems to bias their optimization process to the goals of these domains; and
iii) re-sampling approaches that manipulate the training data distribution so
as to allow the use of standard learning systems. All these three approaches
were extensively explored within the classification scenario (e.g. Kubat and
Matwin (1997); Chawla et al. (2002)). Research work within the regression
setting is much more limited. Torgo and Ribeiro (2009) and Ribeiro (2011)
proposed a set of specific metrics for regression tasks with non-uniform costs
and benefits. Ribeiro (2011) described system ubaRules that was specif-
ically designed to address this type of problem. Still, to the best of our
knowledge, no one has tried re-sampling approaches on this type of regres-
sion tasks. Nevertheless, re-sampling strategies have a clear advantage over
the other alternatives - they allow the use of any existing regression tool
on this type of tasks without the need to change it. The main goal of this
paper is to explore this alternative within a regression context. We describe
two possible methods: i) using an under-sampling strategy; and ii) using a
Smote-like approach that performs both over- and under-sampling.

The main contributions of this work are: i) presenting a first attempt
at addressing rare extreme values prediction using standard regression tools
through re-sampling approaches; and ii) adapting two well-known and suc-
cessful re-sampling methods for regression tasks. The results of the empirical
evaluation of our contributions provide clear evidence on the validity of these
approaches for the task of predicting rare extreme values of a numeric tar-
get variable. The significance of our contributions results from the fact that
they allow the use of any existing regression tool on these important tasks
by simply manipulating the available data set using our supplied code. All
code and data used in this paper is provided in an associated web page1 to
ensure easy reproducibility and re-use of our results and algorithms.

1
http://www.dcc.fc.up.pt/~ltorgo/ExpertSystems

2


2 Problem Formulation

Predicting rare extreme values of a continuous variable is a particular class
of regression problems. In this context, given a training sample of the prob-
lem, D = {〈x,y〉}Ni=1, our goal is to obtain a model that approximates the
unknown regression function y = f(x). The particularity of our target tasks
is that the goal is the predictive accuracy on a particular subset of the do-
main of the target variable Y - the rare and extreme values. As mentioned
before, this is similar to classification problems with extremely unbalanced
classes. As in these problems, the user goal is the performance of the models
on a sub-range of the target variable values that is very infrequent. In this
context, standard regression metrics (e.g. mean squared error) suffer from
the same problems as error rate (or accuracy) on imbalanced classification
tasks - they do not focus on the rare cases performance. In classification the
solution usually revolves around the use of the precision/recall evaluation
framework (Davis and Goadrich, 2006). Precision provides an indication on
how accurate are the predictions of rare cases made by the model. Recall
tells us how frequently the rare situations were signalled as such by the
model. Both are important properties that frequently require some form of
trade-off. How can we get similar evaluation for the numeric prediction of
rare extreme values? On one hand we want that when our models predict an
extreme value they are accurate (high precision), on the other hand we want
our models to make extreme value predictions for the cases where the true
value is an extreme (high recall). Assuming the user gives us information
on what is considered an extreme for the domain at hand (e.g. Y < k1 is an
extreme low, and Y > k2 is an extreme high), we could transform this into
a classification problem and calculate the precision and recall of our models
for each type of extreme. However, this would ignore the notion of numeric
precision. Two predicted values very distant from each other, as long as
being both extremes (above or below the given thresholds) would count as
equally valuable predictions. This is clearly counter-intuitive on regression
problems such as our tasks. A solution to this problem was described by
Torgo and Ribeiro (2009) and Ribeiro (2011) that have presented a formu-
lation of precision and recall for regression tasks that also considers the issue
of numeric accuracy. We will use this framework to compare and evaluate
our proposals for this type of tasks. For completeness, we will now briefly
describe the framework proposed by Ribeiro (2011) that will be used in the
experimental evaluation of our proposal2.

2The code implementing this evaluation framework that was used in our experiments
is available at http://www.dcc.fc.up.pt/~rpribeiro/uba/.

3


2.1 Utility-based Regression

The precision/recall evaluation framework we will use is based on the con-
cept of utility-based regression (Ribeiro, 2011; Torgo and Ribeiro, 2007). At
the core of utility-based regression is the notion of relevance of the target
variable values and the assumption that this relevance is not uniform across
the domain of this variable. This notion is motivated by the fact that con-
trary to standard regression, in some domains not all the values are equally
important/relevant. In utility-based regression the usefulness of a predic-
tion is a function of both the numeric error of the prediction (given by some
loss function L(ŷ,y)) and the relevance (importance) of both the predicted
ŷ and true y values. Relevance is the crucial property that expresses the
domain-specific biases concerning the different importance of the values. It
is defined as a continuous function φ(Y ) : Y → [0, 1] that maps the target
variable domain Y into a [0, 1] scale of relevance, where 0 represents the
minimum and 1 represents the maximum relevance.

To better illustrate the concept of the relevance function, let us consider
a regression problem where the goal is to predict the total percentage of
burnt area on forest fires. Figure 1 shows a possible relevance function for
this problem. It was created with the goal of stressing the domain-specific
knowledge that large scale forest fires are the most important events we
should aim at forecasting3. In effect, the accurate prediction of such type
of fires has high benefits as it allows prevention and planning actions to be
carried out and also avoids the large costs of not being prepared for these
events.

0
.0

0
.2

0
.4

0
.6

0
.8

1
.0

Forest Fires Relevance Function

φ
(Y

)

Y

 0% 46%

Figure 1: Relevance function φ for the percentage of burnt forest area.

According to Ribeiro (2011), the only assumption regarding the shape

3The function φ was obtained by applying piecewise cubic Hermite interpolation to the
box-plot (Cleveland, 1993) statistics of the target variable as described in Ribeiro (2011).
Still, any other method could have been used to define the relevance function.

4


of the function φ() is that there are some ranges of its domain where it as-
sumes a quasi-concave shape, designated as bumps of relevance (see Ribeiro
(2011) for more details). These bumps correspond to the ranges of the target
variable values the user deems as more important. Being a domain-specific
function, it is the user responsibility to specify the relevance function. How-
ever, Ribeiro (2011) describes some specific methods of obtaining auto-
matically these functions when the goal is to be accurate at rare extreme
values, which is the case of our applications. The methods are based on
the simple observation that for these applications the notion of relevance is
inversely proportional to the target variable probability density function. In
our experiments we have used this strategy to come up with the relevance
functions for the data sets we have used. The obtained relevance functions
have essentially one4 or two bumps of relevance associated with the high
and low extreme values, and then a large region in the middle with near
zero relevance, corresponding to the non-extreme and frequent values of the
target variable.

The utility of a model prediction is related to the question on whether it
has led to the identification of the correct type of extreme and if the predic-
tion was precise enough in numeric terms. Thus to calculate the utility of
a prediction it is necessary to consider two aspects: (i) does it identify the
correct type of extreme? (ii) what is the numeric accuracy of the prediction
(i.e. L(ŷ,y))? This latter issue is important because it allows for coping
with different ”degrees” of actions as a result of the model predictions. For
instance, in the context of financial trading an agent may use a decision
rule that implies buying an asset if the predicted return is above a certain
threshold. However, this same agent may invest different amounts depend-
ing on the predicted return, and thus the need for precise numeric forecasts
of the returns on top of the correct identification of the type of extreme.
This numeric precision, together with the fact that we may have more than
one type of extreme (i.e. more than one ”positive” class) are the key dis-
tinguishing features of this framework when compared to pure classification
approaches, and are also the main reasons why it does not make sense to
map our problems to classification tasks.

The concrete utility score of a prediction, in accordance with the original
framework of utility-based learning (e.g. Elkan (2001); Zadrozny (2005)),
results from the net balance between its benefits and costs (i.e. negative
benefits). A prediction should be considered beneficial only if it leads to the
identification of the correct type of extreme. However, the reward should
also increase with the numeric accuracy of the prediction and should be
dependent on the relevance of the true value. In this context, Ribeiro
(2011) has defined the benefit of a prediction as a proportion of its maximum

4Some data sets only have one bump because they either only have high or low rare
extreme values.

5


benefit that is given by the relevance of the true value, i.e. φ(y),

Bφ(ŷ,y) = φ(y) · (1 − ΓB(ŷ,y)) (1)

where ΓB is a bounded loss function for benefits.
The bounded loss ΓB is a [0, 1] function that maps the unbounded range

of a standard regression loss function (e.g. squared error) into a propor-
tion5. ΓB is designed to give the value of 0 if we have a perfect prediction
thus leading to the maximum benefits in the current situation, i.e. φ(y).
The function smoothly increases up to 1 as we move away from a perfect
prediction. ΓB definition ensures that the benefits will be zero if either the
prediction is too inaccurate or the predicted value is not a similar type of
extreme as the true value.

Regarding costs, the rationale is that a prediction should entail a cost
if it is too inaccurate and/or is unable to correctly identify the type of true
value6. To quantify the cost of these erroneous predictions, it is necessary to
determine how relevant the impact of such mistakes is. While the benefits
of a prediction depend on the usefulness of its associated true value (i.e.
φ(y)), costs depend on both the relevance of the true and predicted values,
because they are of different type. This means that costs are proportional
to the relevance of both the true and predicted values. The joint relevance
function captures this notion by calculating a weighted average of these two
factors,

φp(ŷ,y) = (1 −p) ·φ(ŷ) + p ·φ(y) (2)

where p ∈ [0, 1] is a factor differentiating the types of errors.
In the context of our target applications we can distinguish three different

types of mistakes: (i) false alarms where we predict an extreme value for an
irrelevant/normal true value ; (ii) missed events where the model predicts
an irrelevant value but the true value is either an extreme high or low value;
or (iii) confusing events where we predict a high(low) extreme for a true
low(high) extreme value. This third scenario is the most serious type of
mistake and also the one where the value of φp(ŷ,y) will be higher because
both φ(ŷ) and φ(y) will be high.

Ribeiro (2011) defined the cost of a prediction as a proportion of the
maximum cost that is given by the joint relevance of both true and predicted
values. Hence, the cost function C

p
φ is defined as,

C
p
φ(ŷ,y) = φ

p(ŷ,y) · ΓC(ŷ,y) (3)

where φp is the joint relevance function and ΓC is the bounded loss function
for costs.

5See full details in Section 3.3 of Ribeiro (2011).
6Recall that we essentially have 3 types of values - extremes (high or low) or irrelevant

values.

6


Similarly to ΓB, the bounded loss ΓC is a [0,1] function that determines
the proportion of maximum cost assigned to a prediction, based on the
prediction error measured by a standard loss function. It should be 0 for
a perfect prediction and increase up to 1, as it moves away from a perfect
prediction.

Given the above definitions of benefits and costs the utility associated
to any prediction can be calculated as the net balance of these two factors,

U
p
φ(ŷ,y) = Bφ(ŷ,y) − C

p
φ(ŷ,y)

= φ(y) · (1 − ΓB(ŷ,y)) − φp(ŷ,y) · ΓC(ŷ,y)
(4)

where Bφ(ŷ,y), C
p
φ(ŷ,y), ΓB(ŷ,y) and ΓC(ŷ,y) are functions related to the

notions of costs and benefits of predictions that are defined in Ribeiro
(2011).

2.2 Precision and Recall for Regression

Precision and recall are two of the most commonly used metrics to estimate
the performance of models in highly skewed domains (Davis and Goadrich,
2006) such as our target domains. The main advantage of these statistics is
that they are focused on the performance on the target events, disregarding
the remaining cases. In imbalanced classification problems, the target events
are cases belonging to the minority (positive) class. Informally, precision
measures the proportion of events signalled by the model that are real events,
while recall measures the proportion of events occurring in the domain that
are captured by the model.

The notions of precision and recall were adapted to regression problems
with non-uniform relevance of the target values by Torgo and Ribeiro (2009)
and Ribeiro (2011). In this paper we will use the framework proposed by
these authors to evaluate and compare our sampling approaches. We will
now briefly present the main details of this formulation7.

The key issue of the precision/recall framework is the notion of interest-
ing events. The definition of the cases which are to be considered interest-
ing for the user is not so straightforward in regression as in classification.
Ribeiro (2011) suggests the use of the relevance function to define the notion
of interesting events for regression problems. Namely, a case is considered
an interesting event if,

z := I(φ(y) ≥ tE) (5)

where I is the indicator function giving 1 if its argument is true and 0 oth-
erwise, and tE ∈ [0, 1] is a domain-specific event threshold on the relevance
scores.

7Full details can be obtained in Chapter 4 of Ribeiro (2011).

7


In order to calculate precision and recall it is also necessary to establish
when a prediction is accurate concerning a value being or not an event. In-
tuitively, one could be tempted to consider a prediction as a correct forecast
of the event if φ(ŷ) ≥ tE. However, as mentioned by Ribeiro (2011), there
might be cases where the relevance of the predicted value is high and nev-
ertheless the prediction is incorrect. For instance, if the true target value
is an extreme high value and the predicted value is an extreme low value,
both will have high relevance score and still this is a serious error. The
origin of the problem lies on the fact that contrary to the classification set-
up, in utility-based regression there may exist more than one interesting
”class”, which correspond to more than one bump of relevance. This is a
fundamental difference to the imbalanced classification framework.

The solution proposed by Ribeiro (2011) is to resort to the notion of
utility of a prediction to decide its correctness. From the definition of utility
(Equation 4) it follows that: (i) the maximum of U

p
φ(ŷ,y) is φ(y); and (ii)

the minimum of U
p
φ(ŷ,y) is −φ

p(ŷ,y). In this sense, the higher the precision

of the predictions the closer is U
p
φ to φ(y). This means that the utility score

is strongly correlated with the probability that a prediction ŷ is in the same
bump as y, i.e. that it is a correct forecast of some event. Ribeiro (2011)
proposes a method based on the calibration of the raw utility scores to reach
the final decision on whether a predicted value is or not a correct ”event”
prediction, i.e. the value of ẑ. Namely, this binary property is defined as,

ẑ := I

(
s >

1 −u
2

)
(6)

where s is a calibrated utility score, u is the raw utility score (U
p
φ(ŷ,y)) and

I is the indicator function giving 1 if its argument is true and 0 otherwise.
Precision and recall are usually defined as ratios between the correctly

identified events (usually known as true positives within classification), and
either the signalled events (for precision), or the true events (for recall).
Given that the notion of event was defined based on the concept of utility,
Ribeiro (2011) also defines the two denominators of these ratios as functions
of utility, finally leading to the following definitions of precision and recall
for regression,

recall =

∑
i:ẑi=1,zi=1

(1 + ui)

∑
i:zi=1

(1 + φ(yi))
(7)

8


and

precision =

∑
i:ẑi=1,zi=1

(1 + ui)

∑
i:ẑi=1,zi=1

(1 + φ(yi)) +
∑

i:ẑi=1,zi=0

(2 − p (1 −φ(yi)))
(8)

where p is a weight differentiating the types of errors, while ẑ and z are
binary properties associated with being in the presence of a rare extreme
case.

Through the use of utility in our proposed definitions of precision and
recall we are able to penalize not only false alarms (FP) and missed op-
portunities (FN), but also confusing events. In an application where both
type of extremes, high and low, are of interest of the end user, a confusing
event should be heavily penalized. In the stock market domain, for instance,
if a model advises the user to buy when in fact should sell, it is a serious
error. By using utility, that same prediction would be considered as a cost
(i.e. negative utility value). Moreover, the amount of penalization is more
sensitive, in the sense that it takes on a continuous scale depending on the
difference between the true and predicted values because of the definition of
utility.

In the experimental evaluation of our sampling approaches we have used
as main evaluation metric the F-measure that can be calculated with the
values of precision and recall,

F =
(β2 + 1) ·precision ·recall
β2 ·precision + recall

(9)

where β is a parameter weighing the importance given to precision and recall
(we have used β = 1, which means equal importance to both factors).

3 Re-sampling Approaches for Regression

The basic motivation for re-sampling approaches is the assumption that the
imbalanced distribution of the given training sample will bias the learning
systems towards solutions that are not in accordance with the user’s prefer-
ence goal. This occurs because the goal is predictive accuracy on the data
that is least represented in the original data set. Most existing learning
systems work by searching the space of possible models with the goal of
optimizing some criteria. These criteria are usually related to some form of
average performance. These metrics will tend to reflect the performance on
the most frequent cases, which are not the goal of the user. In this context,
the goal of re-sampling approaches is to change the data distribution on the
training sample so as to make the learners focus on cases that are of interest

9


to the user. The change that is carried out has the goal of balancing the
distribution of the least represented (but more important) cases with the
more frequent observations.

Many re-sampling approaches exist within the imbalanced classification
literature. To the best of our knowledge no attempt has been made to apply
these strategies to the equivalent regression tasks - forecasting rare extreme
values. In this section we describe the adaptation of two existing re-sampling
approaches to these regression tasks.

3.1 Under-sampling common values

The basic idea of under-sampling (e.g. Kubat and Matwin (1997)) is to
decrease the number of observations with the most common target variable
values with the goal of better balancing the ratio between these observations
and the ones with the interesting target values that are less frequent. Within
classification this consists on obtaining a random sample from the training
cases with the frequent (and less interesting) class values. This sample is
then joined with the observations with the rare target class value to form
the final training set that is used by the selected learning algorithm. This
means that the training sample resulting from this approach will be smaller
than the original (imbalanced) data set.

In regression we have a continuous target variable. As mentioned in
Section 2.1 the notion of relevance can be used to specify the values of a
continuous target variable that are more important for the user. We can also
use the relevance function values to determine which are the observations
with the common and uninteresting values that should be under-sampled.
Namely, we propose the strategy of under-sampling observations whose tar-
get value has a relevance less than a user-defined parameter. This threshold
will define the set of observations that are relevant according to the user
preference bias,

Dr = {〈x,y〉 ∈D : φ(y) ≥ t} (10)

where t is the user-defined threshold on relevance.
Under-sampling will be carried out on the remaining observations Di =

D\Dr. Regards the amount of under-sampling that is to be carried out the
strategy is the following. For each of the relevant observations in Dr we will
randomly select nu cases from the ”normal” observations in Di. The value
of nu is another user-defined parameter that will establish the desired ratio
between ”normal” and relevant observations. Too large values of nu will
result in a new training data set that is still too unbalanced, but too small
values may result in a training set that is too small, particularly if there are
too few relevant observations.

10


3.2 SMOTE for regression

Smote (Chawla et al., 2002) is a sampling method to address classifica-
tion problems with imbalanced class distribution. The key feature of this
method is that it combines under-sampling of the frequent classes with over-
sampling of the minority class. Chawla et al. (2002) show the advantages of
this approach when compared to other alternative sampling techniques on
several real world problems using several classification algorithms. The key
contribution of our work is to propose a variant of Smote for addressing
regression tasks where the key goal is to accurately predict rare extreme
values, which we will name SmoteR .

The original Smote algorithm uses an over-sampling strategy that con-
sists on generating ”synthetic” cases with a rare target value. Chawla et al.
(2002) propose an interpolation strategy to create these artificial examples.
For each case from the set of observations with rare values (Dr), the strat-
egy is to randomly select one of its k-nearest neighbours from this same
set. With these two observations a new example is created whose attribute
values are an interpolation of the values of the two original cases. Regards
the target variable, as Smote is applied to classification problems with a
single class of interest, all cases in Dr belong to this class and the same will
happen to the new synthetic cases.

There are three key components of the Smote algorithm that we need
to address in order to adapt it for our target regression tasks: i) how to
define which are the relevant observations and the ”normal” cases; ii) how
to create new synthetic examples (i.e. over-sampling); and iii) how to decide
the target variable value of these new synthetic examples. Regarding the
first issue, the original algorithm is based on the information provided by
the user concerning which class value is the target/rare class (usually known
as the minority or positive class). In our problems we face a potentially
infinite number of values of the target variable. As we have mentioned in
Section 2.1, our proposal is based on the existence of a relevance function
and on a user-specified threshold on the values of this function, that leads
to the definition of the set Dr (c.f. Equation 10). Our algorithm will over-
sample the observations in Dr and under-sample the remaining cases (Di),
thus leading to a new training set with a more balanced distribution of
the values. Regarding the second key component, the generation of new
cases, we use the same approach as in the original algorithm though we have
introduced some small modifications for being able to handle both numeric
and nominal attributes. Finally, the third key issue is to decide the target
variable value of the generated observations. In the original algorithm this
is a trivial question, because as all rare cases have the same class (the target
minority class), the same will happen to the examples generated from this
set. In our case the answer is not so trivial. The cases that are to be over-
sampled do not have the same target variable value, although they do have a

11


high relevance score (φ(y)). This means that when a pair of examples is used
to generate a new synthetic case, they will not have the same target variable
value. Our proposal is to use a weighed average of the target variable values
of the two seed examples. The weights are calculated as an inverse function
of the distance of the generated case to each of the two seed examples.

Algorithm 1 The main SmoteR algorithm.

function SmoteR(D, tE,o,u,k)
// D - A data set
// tE - The threshold for relevance of the target variable values

// %o,%u - Percentages of over- and under-sampling

// k - The number of neighbours used in case generation

rareL ←{〈x,y〉 ∈D : φ(y) > tE ∧y < ỹ} // ỹ is the median of the target Y
newCasesL ← genSynthCases(rareL, %o,k) // generate cases for rareL
rareH ←{〈x,y〉 ∈D : φ(y) > tE ∧y > ỹ}
newCasesH ← genSynthCases(rareH, %o,k) // generate cases for rareH
newRareCases ← rareL

⋃
newCasesL

⋃
rareH

⋃
newCasesH

tgtNorm ←%u of |newRareCases|
newNormCases ←sample of tgtNorm cases ∈D\{rareL

⋃
rareH}

return newRareCases
⋃

newNormCases
end function

Algorithm 1 describes our proposed SmoteR sampling method. The
algorithm uses a user-defined threshold (tE) of relevance to define the sets
Dr and Di. Notice that in our target applications we may have two rather
different sets of rare cases: the extreme high and low values. This is another
difference to the original algorithm. The consequence of this is that the gen-
eration of the synthetic examples is also done separately for these two sets.
The reason is that although both sets include rare and interesting cases,
they are of different type and thus with very different target variable values
(extremely high and low values). The other parameters of the algorithm are
the percentages of over- and under-sampling, and the number of neighbours
to use in the cases generation. The key aspect of this algorithm is the gen-
eration of the synthetic cases (the two calls to function genSynthCases).
This process is described in detail on Algorithm 2. The main differences
to the original Smote algorithm are: the ability to handle both numeric
and nominal variables; and the way the target value for the new cases is
generated. Regards the former issue we simply perform a random selection
between the values of the two seed cases. A possible alternative could be to
use some biased sampling that considers the frequency of occurrence of each
of the values within the rare cases. Regarding the target value we have used
a weighted average between the values of the two seed cases. The weights
are decided based on the distance between the new case and these two seed
cases. The larger the distance the smaller the weight.

12


Algorithm 2 Generating synthetic cases.
function genSynthCases(D,o,k)

newCases ←{}
ng ←%o/100 // nr. of new cases to generate for each existing case
for all case ∈D do

nns ← kNN(k,case,D\{case}) // k-Nearest Neighbours of case
for i ← 1 to ng do

x ← randomly choose one of the nns
for all a ∈ attributes do // Generate attribute values

if isNumeric(a) then
diff ← case[a] −x[a]
new[a] ← case[a] + random(0, 1) ×diff

else
new[a] ← randomly select among case[a] and x[a]

end if
end for
d1 ← dist(new,case) // Decide the target value
d2 ← dist(new,x)
new[Target] ← d2×case[T arget]+d1×x[T arget]

d1+d2
newCases ← newCases

⋃
{new}

end for
end for
return newCases

end function

4 Experimental Evaluation

The goal of our experiments is to test the effectiveness of our proposed
sampling approaches at predicting rare extreme values of a continuous target
variable. For this purpose we have selected 17 regression data sets whose
main characteristics are described in Table 1. For each of these data sets we
have obtained a relevance function using the automatic method proposed
by Ribeiro (2011). This method assigns higher relevance for values above
(below) the adjacent values of the target variable sample distribution. These
are calculated as a function of the quartiles and the inter-quartile range and
are well-known thresholds for considering a value as an outlier. The result
of applying this method are relevance functions that assign higher relevance
to high and low rare extreme values, which are the target of the work in this
paper. Based on these relevance functions we have imposed a threshold of
0.7 on the values of φ(Y ) as the condition for a value to be taken as a rare
extreme. As it can be seen from the information in Table 1 this leads to
an average of 10-15% of the available cases having a rare extreme value for
most data sets considered in our experiments.

In order to avoid any algorithm-dependent bias distorting our exper-
imental results, we have carried out our comparisons using a diverse set
of standard regression algorithms. Moreover, for each algorithm we have

13


Data Set N p nRare %Rare Data Set N p nRare %Rare

a1 198 12 32 0.162 dAiler 7129 6 617 0.087
a2 198 12 26 0.131 availPwr 1802 16 197 0.109
a3 198 12 34 0.172 bank8FM 4499 9 384 0.085
a4 198 12 37 0.187 cpuSm 8192 13 804 0.098
a5 198 12 22 0.111 dElev 9517 7 1109 0.117
a6 198 12 35 0.177 fuelCons 1764 38 225 0.128
a7 198 12 29 0.146 boston 506 14 74 0.146
Abalone 4177 9 882 0.211 maxTorque 1802 33 163 0.09
Accel 1732 15 106 0.061

Table 1: Used data sets and characteristics (N: n. of cases; p: nr. of predictors;
nRare: nr. cases with φ(Y ) > 0.70; %Rare: nRare/N).

Learner Parameter Variants R package

MARS nk = {10, 17},degree = {1, 2}, thresh = {0.01, 0.001} earth (Milborrow, 2012)
SVM cost = {10, 150, 300},gamma = {0.01, 0.001} e1071 (Dimitriadou et al., 2011)
Random Forest mtry = {5, 7},ntree = {500, 750, 1500} randomForest (Liaw and Wiener, 2002)

Table 2: Regression algorithms and parameter variants, and the respective
R packages.

considered several parameter variants. Table 2 summarizes the learning al-
gorithms that were used and also the respective parameter variants. For
instance, for the MARS regression algorithm we have considered the 8 vari-
ants resulting from all combinations of parameter values reported in Table 2.
To ensure easy replication of our work we have used the implementations
of these algorithms available in the free and open source R environment (R
Core Team, 2013), which is also the infrastructure used to implement our
proposed sampling methods.

Each of the 20 learning approaches (8 MARS variants + 6 SVM vari-
ants + 6 Random Forest variants), were applied to each of the 17 regres-
sion problems using 13 different sampling approaches, namely: i) carry-
ing out no sampling at all (i.e. use the data set with the original imbal-
ance); ii) 8 variants of our SmoteR method; and iii) 4 variants of under-
sampling. The 8 SmoteR variants used 5 nearest neighbours for case gen-
eration, a relevance threshold of 0.70 and all combinations of {200, 500}%
and {50, 100, 200, 300}% for percentages of over- and under-sampling, re-
spectively (c.f. Algorithm 1). For instance, in the data set availPwr, which
contains 197 rare cases (c.f. Table 1), when using 200% oversampling and
50% under-sampling, we would generate 2 new cases for each existing rare
case thus leading to 591 (= 197 + 197 × 2) rare cases and 295 normal cases
(50% of 591), thus a training set of 886 cases, whilst the original data set
contains 1802 cases. With respect to under-sampling the 4 variants used
{50, 100, 200, 300}% as percentages of under-sampling and the same 0.70

14


0 5 10 15 20 25 30 35

0.
00

0.
05

0.
10

0.
15

Acceleration

D
en

si
ty

φ (
Y
)

0.
0

0.
2

0.
4

0.
6

0.
8

1.
0

None
U0.5
S.o2.u0.5
φ(Y)

100 200 300 400

0.
00

0
0.

00
5

0.
01

0
0.

01
5

0.
02

0

Available Power

D
en

si
ty

φ (
Y
)

0.
0

0.
2

0.
4

0.
6

0.
8

1.
0

None
U0.5
S.o2.u0.5
φ(Y)

Figure 2: Distribution of the target variable before and after re-sampling.

relevance threshold. This means that if applying under-sampling with 50%
to the same availPwr data set, we would end up with a training set formed
by 197 rare cases plus 98 (50 % of 197) normal cases.

To have a better idea on the impact of these re-sampling strategies on
the training set that is finally given to the learners, Figure 2 shows the
distribution of the target variable8 on the original data and on the data
sets resulting from applying two of the most successful variants of our re-
sampling strategies, for 2 specific data sets. The graphs on this figure clearly
illustrate the change in the target variable distribution that is carried out
by these re-sampling strategies with the goal of biasing this distribution
towards the areas where the relevance function has higher values. Moreover,
as illustrated above the methods also change the number of cases used for
training, which will obviously have an impact on the computation time taken
to obtain the models. More specifically for the data sets in Figure 2, the
original Acceleration data set contains 1732 observations and the S.o2.u0.5
configuration of SmoteR leads to a training set of 477, whilst the U0.5
under-sampling variant uses only 159 cases. With respect to the Available
Power data set the original size is 1802 and the two re-sampling variants
use 886 and 295, respectively. Given that the SmoteR variant uses a more
complex algorithm to obtain the training set involving distance calculations
to find the nearest neighbours of rare cases, and leads to a larger training
set, we can immediately see that this re-sampling approach requires more
computation time than the simpler under-sampling strategy. Still, under
these two particular configurations the training set will always be smaller
than the original data set.

The goal of our experiments is to compare the 12 (8 SmoteR + 4 under-

8Approximated through a kernel density estimator.

15


Figure 3: Behaviour of the sampling strategies on 3 illustrative data sets (S
- SmoteR ; U - under-sampling; ox - x×100% over-sampling; ux - x×100%
under-sampling).

sampling) re-sampling approaches against the default of learning with the
given data, using 20 learning approaches on 17 data sets. All alternatives
were evaluated according to the F-measure with β = 1 (c.f. Equation 2.2),
which means that the same importance was given to both precision and recall
scores that were calculated using the set-up described in Section 2.2. The
values of the F-measure were estimated by means of 3 repetitions of a 10-
fold cross validation process and the statistical significance of the observed
paired differences was measured using the non-parametric Wilcoxon paired
test.

Given the amount of experimental variants it is not possible to show all
results given the space restrictions of a paper. Figure 3 shows the results
for 3 of our 17 data sets. The full results for all data sets can be seen in the
web page associated with the paper. For each combination of data set and
regression algorithm the graph provides 13 box-plots, one for each of the
12 mentioned sampling approaches plus the alternative of using the original
data (tagged as none in the graphs). The box plots show the distribution
of the F1 scores of all variants of each learner on each data set. All scores
are estimated by means of a 3 × 10−fold cross validation process. These

16


Sampling Strat. Win (99%) Win (95%) Loss (99%) Loss (95%) Insignif. Diff.

U0.5 243 31 17 5 44
U1 213 41 10 4 72
U2 184 36 7 2 111
U3 138 42 11 3 146
S.o2.u0.5 229 31 14 6 60
S.o5.u0.5 216 42 15 3 64
S.o2.u1 201 52 12 3 72
S.o5.u1 199 44 14 3 80
S.o2.u2 173 43 11 2 111
S.o5.u2 172 46 13 1 108
S.o2.u3 127 69 17 2 125
S.o5.u3 142 46 15 6 131

Table 3: Summary of the paired comparisons to the no sampling baseline.

3 particular data sets were chosen because they represent three different
patterns of results. Results on data set a7 are among the best from the
re-sampling approaches perspective. The boston data set can be regarded
as an example of a domain where the advantage of re-sampling approaches
is not so marked, while data set bank8FM is an example where re-sampling
approaches behave poorly with the exception of random forest variants.

When taking into consideration all 17 data sets, in most cases we have
an advantage of the re-sampling approaches. This can be confirmed when
looking at the overall results in terms of the statistical significance of the
paired differences between each sampling approach and the alternative of
using the original data (the baseline). Table 3 summarizes the results of
these paired comparisons. Each sampling strategy was compared against the
baseline 340 times (20 learning variants times 17 data sets). For each paired
comparison we check the statistical significance of the difference between
the average F score obtained with the respective sampling approach and
with the baseline. These averages were estimated using a 3 × 10-fold CV
process. We counted the number of statistically significant wins and losses
of each of the 12 sampling variants on these 340 paired comparisons using
two significance levels (99% and 95%).

The results of Table 3 provide clear evidence of the advantage of using
re-sampling approaches when the task is to predict rare extreme values of
a continuous target variable. In effect, we can observe an overwhelming
advantage in terms of number of statistically significant wins over the al-
ternative of using the data set as given (i.e. no re-sampling). For instance,
the particular configuration of using under-sampling with 50% (U0.5 ) was
significantly better than the alternative of using the given data set on 80.5%
((243 + 31)/340) of the 340 considered situations, while only on 6.4% of
the cases under-sampling actually lead to a significantly worse model. The

17


remarkable performance of this very simple re-sampling strategy is even
re-enforced by the fact that it uses a much smaller training set than the
other alternatives, which means lower computation costs. The SmoteR
variant with 200% over-sampling and 50% under-sampling (S.o2.u0.5 ) also
achieved very good results (76.5% significant wins). A pattern of results
we have observed in our experiments is that higher values of the percent-
age of under-sampling leads to worse results. The lower this percentage
the higher the imbalance in favour of the rare cases (for instance with 50%
under-sampling there will be twice as many rare cases as normal cases). This
has improved performance in all our experimental set ups. Another conclu-
sion from our experiments is that increasing the percentage of over-sampling
(i.e. generating more synthetic cases) makes the results worse. This may
be an indication that the process being used for generating new cases may
be taking too many risks leading to a degradation of the accuracy of the
models.

In summary, the results of our experimental comparisons provide clear
evidence on the validity of the re-sampling approaches we have considered,
with particularly good results being obtained with training sets that bias
the distribution towards a higher frequency of rare cases (i.e. do strong
under-sampling of the normal cases).

5 Conclusions

This paper has presented a general approach to tackle the problem of fore-
casting rare extreme values of a continuous target variable using standard
regression tools. The key advantage of the described re-sampling approaches
is their simplicity. They allow the use of standard out-of-the-box regression
tools on these particular prediction tasks by simply manipulating the distri-
bution of the available training data.

The key contributions of this paper are : i) showing that re-sampling
approaches can be successfully applied to this type of regression tasks; and
ii) adapting two of the most successful sampling methods (Smote and under-
sampling) to regression tasks.

The large set of experiments we have carried out on a diverse set of
problems and using rather different learning algorithms, highlights the ad-
vantages of our proposals when compared to the alternative of simply apply-
ing the algorithms to the available data sets. Particularly noticeable are the
results obtained with strong under-sampling of the normal cases. Moreover,
in the case of under-sampling with 50% the good predictive performance is
accompanied by lower computation costs.

R code implementing both the SmoteR and under-sampling strategies
described in Section 3 is freely provided at http://www.dcc.fc.up.pt/

~ltorgo/ExpertSystems. This URL also includes all code and data sets

18


necessary to replicate the experiments in the paper as well as extra experi-
mental results that we could not fit within the paper.

Acknowledgements

This work is part-funded by the ERDF - European Regional Development
Fund through the COMPETE Programme (operational programme for com-
petitiveness), by the Portuguese Funds through the FCT (Portuguese Foun-
dation for Science and Technology) within project FCOMP - 01-0124-FEDER-
022701.

References

Chawla, N. V. (2005). Data mining for imbalanced datasets: An overview.
In The Data Mining and Knowledge Discovery Handbook. Springer.

Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002).
Smote: Synthetic minority over-sampling technique. JAIR, 16:321–357.

Cleveland, W. (1993). Visualizing Data. Hobart Press.

Davis, J. and Goadrich, M. (2006). The relationship between precision-recall
and roc curves. In ICML’06: Proc. of the 23rd Int. Conf. on Machine
Learning, ACM ICPS, pages 233–240. ACM.

Dimitriadou, E., Hornik, K., Leisch, F., Meyer, D., and Weingessel, A.
(2011). e1071: Misc Functions of the Department of Statistics (e1071),
TU Wien.

Domingos, P. (1999). Metacost: A general method for making classifiers
cost-sensitive. In KDD’99: Proceedings of the 5th International Confer-
ence on Knowledge Discovery and Data Mining, pages 155–164. ACM
Press.

Elkan, C. (2001). The foundations of cost-sensitive learning. In IJCAI’01:
Proc. of 17th Int. Joint Conf. of Artificial Intelligence, volume 1, pages
973–978. Morgan Kaufmann Publishers.

Kubat, M. and Matwin, S. (1997). Addressing the curse of imbalanced
training sets: One-sided selection. In Proc. of the 14th Int. Conf. on
Machine Learning, pages 179–186. Morgan Kaufmann.

Liaw, A. and Wiener, M. (2002). Classification and regression by random-
forest. R News, 2(3):18–22.

Ling, C. and Sheng, V. (2010). Cost-sensitive learning and the class imbal-
ance problem. In Encyclopedia of Machine Learning. Springer.

19


Milborrow, S. (2012). earth: Multivariate Adaptive Regression Spline Mod-
els. Derived from mda:mars by Trevor Hastie and Rob Tibshirani.

R Core Team (2013). R: A Language and Environment for Statistical Com-
puting. R Foundation for Statistical Computing, Vienna, Austria.

Ribeiro, R. P. (2011). Utility-based Regression. PhD thesis, Dep. Computer
Science, Faculty of Sciences - University of Porto.

Torgo, L. and Ribeiro, R. P. (2007). Utility-based regression. In PKDD’07:
Proc. of 11th European Conf. on Principles and Practice of Knowledge
Discovery in Databases, pages 597–604. Springer.

Torgo, L. and Ribeiro, R. P. (2009). Precision and recall in regression. In
DS’09: 12th Int. Conf. on Discovery Science, pages 332–346. Springer.

Zadrozny, B. (2003). Policy mining: Learning decision policies from fixed
sets of data. PhD thesis, University of California, San Diego.

Zadrozny, B. (2005). One-benefit learning: cost-sensitive learning with re-
stricted cost information. In UBDM’05: Proc. of the 1st Int. Workshop
on Utility-Based Data Mining, pages 53–58. ACM Press.

20