Bootstrapping with Noise: An Effective Regularization Technique


This article was downloaded by: [ ]
On: 22 September 2011, At: 03:16
Publisher: Taylor & Francis
Informa Ltd Registered in England and Wales Registered Number:
1072954 Registered office: Mortimer House, 37-41 Mortimer Street,
London W1T 3JH, UK

Connection Science
Publication details, including instructions
for authors and subscription information:
http://www.tandfonline.com/loi/ccos20

Bootstrapping with
Noise: An Effective
Regularization Technique
YUVAL RAVIV a & NATHAN INTRATOR a
a School of Mathematical Sciences,
Sackler Faculty of Exact Sciences, Tel-Aviv
University, Ramat Aviv, 69978, Israel

Available online: 01 Jul 2010

To cite this article: YUVAL RAVIV & NATHAN INTRATOR (1996): Bootstrapping
with Noise: An Effective Regularization Technique, Connection Science, 8:3-4,
355-372

To link to this article:  http://dx.doi.org/10.1080/095400996116811

PLEASE SCROLL DOWN FOR ARTICLE

Full terms and conditions of use: http://www.tandfonline.com/page/
terms-and-conditions

This article may be used for research, teaching and private study
purposes. Any substantial or systematic reproduction, re-distribution,
re-selling, loan, sub-licensing, systematic supply or distribution in
any form to anyone is expressly forbidden.

The publisher does not give any warranty express or implied or make
any representation that the contents will be complete or accurate
or up to date. The accuracy of any instructions, formulae and drug
doses should be independently verified with primary sources. The
publisher shall not be liable for any loss, actions, claims, proceedings,
demand or costs or damages whatsoever or howsoever caused

http://www.tandfonline.com/loi/ccos20
http://dx.doi.org/10.1080/095400996116811
http://www.tandfonline.com/page/terms-and-conditions
http://www.tandfonline.com/page/terms-and-conditions


arising directly or indirectly in connection with or arising out of the
use of this material.

D
ow

nl
oa

de
d 

by
 [

 ]
 a

t 
03

:1
6 

22
 S

ep
te

m
be

r 
20

11
 

C onnection Scien ce, V ol. 8, N os 3 & 4, 1996, 355± 372

B ootstrapping with N oise: A n E ffective

R egu larization T echniqu e

Y U V A L R A VIV & N AT H A N IN T RA T OR

Bootstrap sam ples w ith n oise are shown to be an effective sm ooth ness and capacity control

technique for tra ining feedforward network s and for other statistical m ethods such as

genera lized additive m odels. It is shown that noisy bootstrap perform s best in conjunc tion

w ith w eight-decay regula riz ation and ensemble averaging. T he tw o-sp iral problem , a

highly non-lin ear, noise-free data, is used to dem onstra te these ® ndings. T he com bination

of noisy bootstrap and ensemble averaging is also show n useful for generalized additive

m odelling, and is also dem onstra ted on the w ell-k now n C leveland heart data.

K EYW ORDS : N oise injection, com b ining estim ators, pattern classi® cation, two-spira l

prob lem , clinical data analysis.

1. Intro du ction

T he boo tstrap technique h as b ecom e one of the majo r tools for prod ucing

em pirical con® dence intervals of estimated param eters or predictors (Efro n &

T ibshira ni, 1993). One way to view b ootstrap is as a m ethod to sim ulate noise

inherent in the data, and thus increase effec tively the num ber of training patterns.

A sim p le b ootstrap p rocedure am ounts to sam p ling with return fro m the training

data, and constructing several training sets, all with the sam e size as the original

training set. Later, the variab ility betw een the estim ated param eters can b e

m easu red, and give som e indication abo ut the true variab ility of the m odel

param eters arisin g from the data. Furthermore, variabi lity of the p rediction, or

error bars on the prediction, can also be estim ated in this way.

O ne varian t of b ootstrap involve s estim ation of a m odel of the form

y 5 f(x) 1 «

fo r som e p aram etric fam ily to which f belongs, and a noise « wh ich is assu m ed to
be sm all with zero m ean. O nce an em pirical fu nction fÃ h as been estim ated fro m n

training sam ples, there rem ains a noise vector « 5 ( « 1, ¼ , « n). O ne can then
sam p le from the em pirical distribu tion of the noise b y sam pling (w ith return) from

« i and constructing new sam ples of the form (x
*
i , y

*
i ), in which « i was replac ed by

« *i sam pled from the abo ve set. Clearly, this app roach can b e easily extended to a

Y. Raviv and N . Intrator, School of M athematical Sciences, Sackler Faculty of E xact Sciences, Tel-Aviv

U niversity, Ram at Aviv 69978, Israel. E-mail: {yuv,nin}@ math.tau.ac.il. Present address of N . In trator,

Institute of Brain and N eural System s, Box 1843, Brown U niversity, Providence, RI 02912, U SA.

0954± 0 091/96/030355± 1 8 $7.50 Ó 1996 Journals Oxford Ltd

D
ow

nl
oa

de
d 

by
 [

 ]
 a

t 
03

:1
6 

22
 S

ep
te

m
be

r 
20

11
 

356 Y. R aviv & N . Intrator

sm oothed bo otstrap (E fron & T ibshirani, 1993) by sam p ing fro m the em p irical

distrib ution by « i rather than just sam pling from the set of « i’ s. In such case, one
can increase the size of each bo ostrap set, sin ce due to the noise, the differen t sets

are su f® ciently independent. It sh ould b e noted that if fÃ is b iase d, the noise vector

m ay be over-estim ated.

F or classi® cation p roblems, the form

y 5 f(x 1 « )

m ay b e more ap pro priate. In this case, using noise injection to the inputs during

training can im prove the generalization p roperties of the estim ator (Sietsm a &

D ow, 1991). R ecently, B ish op (1995) has sh own that training w ith sm all am ounts

of noise is locally equivalent to sm oothness regularization. In this paper, we give

a diffe rent interpretation to noise added to the input during training, and view it

as a regularizing param eter that controls, in conju nction w ith ensem b le averaging,

the cap acity and the sm oothness of the estim ator. T he m ajo r role of this noise is

to push diffe rent estimators to diffe rent local m inim a, and so prod uce a m ore

independent set of estim ators. B est perform ance is then achieved by ave raging over

the estim ators. For this regularization, the level of the noise m ay b e larger than the

`true’ level wh ich can b e indirectly estim ated. Since we want to study the effec t of

bo otstrappin g with noise on the sm oothness of the estim ator, separated from the

task of inp ut noise estim ation, w e consider a highly non-lin ear, noise-fre e

classi® cation prob lem , and sho w that even in this extreme case, addition of noise

during training im proves results signi® cantly.

W e cho se a p roblem that is very dif® cult fo r fe edfo rw ard neural networks

(N N s). It is dif® cult due to the h ighly non-lin ear nature of the decision

bo undaries, and the fac t that these non-lin earities are easier to represent in local

radially sym m etric fu nctions rath er than in ridge fu nctions such as those given by

fe edforw ard sigm oidal fu nctions. Since the training data are given w ith no noise,

it seems unreasonable to train a netw ork with noise, but we sho w that even in this

case training with noise is a very effective app roach fo r sm oothing the estim ator.

In addition to dem onstrating our m ethod on a diffe rent class of predictorsÐ the

generalized additive m odelsÐ we also app ly it to another well-kn own data setÐ the

C leveland heart data (D etrano et al., 1989).

2. T heoretical C onsideratio ns

T here are a num ber of factors that have to be app lied carefully w hen trying to

regularize an estim ator. T he regularization is aim ed at ® nding an op tim al trade-off

between the varian ce and b ias of the estim ator (G em an et al., 1992), and fo r best

perfo rm ance one has to utilize this decom po sition of the error fu nction. Th e

m otivation to our ap pro ach follow s from a key ob servatio n regarding the b ias

variance decom positio n, nam ely the fact that ensem b le ave raging does not affe ct

the bias p ortion of the error, b ut reduces the varian ce w hen the estim ators on

wh ich averaging is done are independent.

2.1. Bias/V ariance T rade-off for En semble of P redictors

T he classi® cation pro blem is to estim ate a fu nction f X (x) of observed data

charac teristics x, predicting class labe l y, b ased on a given training set

X 5 {(x 1, y1)}, . . . , (X L, Y L)} using som e m easure of the estim ation error on X .

D
ow

nl
oa

de
d 

by
 [

 ]
 a

t 
03

:1
6 

22
 S

ep
te

m
be

r 
20

11
 

Bootstrapping w ith N oise 357

A good estim ator w ill perfo rm well not only on the training set, but also on new

valid ation sets wh ich were not used during estim ation.

Evalu ation of the perfo rm ance of the estim ator is com m only done via the m ean

squ ared error (M SE) distance b y taking the expectation with respect to the

(unknown) p robability distribu tion P of y:

E[(y 2 f X (x))2 u x, X ]

T his can be decom p osed into

E[(y 2 f X (x))2 u x, X ] 5 E[(y u x])2 u x, X ] 1 E[(f X (x) 2 E [y u x])2]

T he ® rst term does not depend on the training data X or on the estim ator f X (x);
it m easu res the am ount of noise or variab ility of y given x. H ence, f can b e

evaluated using

E[(f X (x) 2 E[y u ])2]

T he em pirical M SE of f is given by

E X [(f X (x) 2 E[y u x])2]

wh ere E X represents expectation with resp ect to all possib le training sets X of ® xed
size.

T o see fu rther the perfo rm ance under M SE, we decom po se the error to b ias

and varian ce com po nents to get

E X [(f X (x) 2 E[y u x)2] 5 (E X [f X (x)] 2 E[y u x])2 1 E X [(f X (x) 2 E X [fD (x)])2] (1)

T he ® rst term on the right-han d sid e is called the b ias of the estim ator and the

second term is called the variance. W hen training on a ® xed training set X ,
reducing the bias with resp ect to this set m ay increase the variance of the estimator

and contribute to po or generalization perform ance. T his is known as the trade-off

between variance and bias. T ypic ally, varian ce is reduced by sm oothing; how ever,

this m ay introduce bias (since, fo r exam ple, it m ay blur sharp peaks). B ias is

reduced by prio r know ledge. W hen prio r know ledge is used also fo r sm oothing, it

is likely to reduce the overall M SE of the estim ator.

W hen training N N s, the varian ce arise s from tw o term s. T he ® rst term com es

from inherent data ran dom ness and the second term com es fro m the non-

identi® ability of the m odel, nam ely, the fac t that for a given training data, there

m ay be several (local) m inim a of the error su rface.
1

C onsid er the ensem ble ave rage fÅ of several predictors, e.g. N N s with diffe rent

random initial w eights which are trained on data with added G aussian noise:

fÅ (x ) 5
1

N O
N

i 5 1
f i (x )

T hese predictors are identically distributed and, thus, the varian ce contribution

(equation (1)) becom es (we om it x and X for clarity)

E [( fÅ 2 E [ fÅ ])2 ] 5 E F S 1N O f i 2 E F
1

N O f i G D
2 G

5 E F S 1N O f i D
2 G 1 S E F 1N O f i G D

2

2 2E F 1N O f i E F
1

N O f i G G

D
ow

nl
oa

de
d 

by
 [

 ]
 a

t 
03

:1
6 

22
 S

ep
te

m
be

r 
20

11
 

358 Y. R aviv & N . Intrator

5 E F S 1N O f i D
2 G 2 S E F 1N O f i G D

2

(2)

T he ® rst term on the righ t-h and sid e can be rewritten as

E F S 1N O f i D
2 G 5 1N 2 O E [ f 2i ] 1

2

N
2 O

i , j
E [ f i f j ]

and the second term gives

S E F 1N O f i G D 2 5
1

N
2 O E [ f i ] 1 2N 2 Oi , j E [ f i ]E [ f j ]

Plugging these equalities in equation (2) gives

E [( fÅ 2 E [ fÅ ])2 ] 5
1

N
2 O {E [ f 2i ] 2 (E [ f i ])2} 1 2N 2 Oi , j {E [ f i f j ] 2 E [ f i ]E [ f j ]} (3)

If the predictors { f i } are highly correlated, fo r exam p le if f i 5 f j 5 f fo r all i , j , then
the abo ve equation becom es

Var ( fÅ ) 5
1

N
V ar ( f ) 1

2

N
2

N ( N 2 1)

2
V ar ( f ) 5 Var ( f )

nam ely, there is no reduction in varian ce
2

in this case. If the predictors are

identically distrib uted and independent, then the second term drop s and we are

left with

Var ( fÅ ) 5
1

N
V ar ( f i )

N ote that

E [ f i f j ] 2 E [ f i ]E [ f j ] 5 E ({ f i 2 E [ f i ]}{ f j 2 E [ f j ]})

T hus, the notion of independence can be understood as independence of the

deviations of each predictor from the expected valu es of the p redictor, which can

be replaced (due to linearity) by

E ({ f i 2 E [ fÅ ]}{ f j 2 E [ fÅ ]})

and is thus interpreted as an independence of the prediction variation aro und a

com m on m ean.

T he success of ensem ble averag ing of N N s in the past (B reim an, 1994; H ansen

& Salam on, 1990; P errone, 1993; W olp ert, 1992) is due to the fact that N N s have

in general m any local m inim a, and thus even with the sam e training set, diffe rent

local m inim a are fo und when starting from diffe rent ran dom initial conditions.

T hese diffe rent local m inima lead to so m ewh at independent predictors, and thus

the averaging can reduce the varian ce. W hen a larger set of independent networks

is needed, but no m ore data are availab le, data reuse m ethods can help. Bootstrap-

ping (Breim an, 1994) has been very helpfu l, since by resam p ling (with return)

from the training data, the independence of the training sets is increased, and

hence, the independence of the estim ators, leading to im prove d ensem ble results.

Sm oothed bo otstrap (K rogh & H ertz, 1992; R ipley, 1996) is potentially m ore

usefu l sin ce larger sets of independent training sam ples can be generated. Th e

sm oothed bo otstrap approa ch am ounts to generating larger data sets by sim ulating

the true noise in the data.

D
ow

nl
oa

de
d 

by
 [

 ]
 a

t 
03

:1
6 

22
 S

ep
te

m
be

r 
20

11
 

Bootstrapping w ith N oise 359

3. T he B ootstra p E nsem ble w ith N oise A lgorithm

In the boo tstrap ensem ble with noise (B EN ), w e p ush the idea of noise injection

fu rther; we observe that adding noise to the inputs increases the ® rst term on the

right-han d sid e of equation (3), i.e. adds varian ce to each estim ator, but, on the

other han d, decreases the contribution of the second term on the right-han d sid e

as it increases the independence between estim ators. Instead of usin g the `true’

noise (estim ated from the data) for b ootstrap, we seek an optim al noise level which

gives the sm allest contribution to the error fro m the sum of the two com po nents

of the varian ce. It is im po ssible to calculate the optimal varian ce of the G aussian

noise without k nowing f explicitly; therefore, the value of this varian ce rem ains a

regularization term : a param eter which has to be estim ated so as to m inim ize the

total contribution of the varian ce to the error. Furtherm ore, sin ce the injection of

noise increase s the independence betw een diffe rent training sets, we can use

bo otstrap sets that are larger than the original training set. Th is does not affe ct the

bias (if the noise is sym m etric arou nd zero) but can reduce the varian ce. N ote

that the bias contribution to the error is not affe cted by introducing the ensem ble-

averag e estim ator due to linearity of expectations.

It fo llows that the BEN approa ch has the po tential of reducing the contribution

of the varian ce term to the total erro r. W e thus sho uld seek a differen t trade-off

po int b etween the contribution of the varian ce and the b ias. In other words, we are

able to use large (unbiased ) networks w ithout being affe cted by the large varian ce

assoc iated w ith su ch networks. T his obse rvatio n im plies that the estim ation of

optim al noise levels shou ld not be based on a single estim ator perform ance, b ut

rath er base d on the ensem ble perfo rm ance. Th e large varian ce of each sin gle

network in the ensem ble can be tem pered w ith a regularization such as w eight

decay (K rogh & H ertz, 1992; R ipley, 1996), but, again, the estim ation of the

optim al regularization factor sho uld be done on the ensem ble-averaged perfo rm -

ance. Breiman (1994) and R ipley (1996) sho w com pelling em pirical evidence fo r

the im po rtance of weight decay as a single network stab ilizer. Our resu lts con® rm

this fact under the BEN m odel.

T he BEN Algorithm

· Let {( x i , y i )} be a set of training p atterns for i 5 1, . . . , N .
· Let « 5 { « 1, . . . , « J }.
· Let l 5 { l 1, . . . , l I }.
· F or a noise level « j estim ate an op tim al penalty term fo r weight decay l i :

Ð Fix a size K fo r the bo otstrap sam ple, su ch that K @ N (we used K 5 10N ).
Ð Let s 1, s 2 , . . . , s K b e a set of indices, chosen fro m a unifo rm distribution,

s i , U (1, N ).
Ð For a « j , create a noisy b ootstrap resam p le of the training set inputs:

{ x s i 1 z i }i 5 1,...,K and the corresp onding resam pled outputs { y s i }i 5 1,...,K wh ere
z i is a vector who se com po nents are N (0, « 2j ).

Ð Train several networks with the noisy sam ples using weight decay l 1, . . . , l I .
Ð G enerate an ensem ble averag e of the set of netw orks.

Ð Ch oose via cross-v alidation or a test set, the op tim al weight decay l .

· R epeat the process for the new choice of noise « j until there is no im pro vem ent
in prediction.

In the sim ple case, the sam e noise level is used for each dim ension. T his is su itable

D
ow

nl
oa

de
d 

by
 [

 ]
 a

t 
03

:1
6 

22
 S

ep
te

m
be

r 
20

11
 

360 Y. R aviv & N . Intrator

F igure 1. Th e two-spira ls training data (left). T raining points with noiseÐ standard

deviation, SD 5 0.3 (right). A s can b e seen, the noise level that contam inates the data
causes objects to cross the virtual bo undary de® ned b y the data, i.e. the noise leads

to wrong class lab elling for the training data. Th is reduces perfo rm ance of sin gle

predictors, but the added independence b etween the predictors leads to im proved

ensem ble perform ance.

fo r problem s in wh ich each of the dim ensions are on the sam e scale, or, m ore

precisely, wh en the noise distribu tion is sim ilar in differen t data dim ensions. W hen

all covariates have the sam e interpretation, e.g. sim ilar m easu rem ents taken at

differen t tim e steps, or w hen dealing with pixel data, su ch noise assu m ption is

adequate; ho wever, wh en the noise is non-h om ogeneous in spac e, has a non-

diagonal covariance m atrix or wh en differen t dim ensions represent com pletely

differen t m easurem ents, it is best to estim ate the diffe rent noise levels in each

dim ension sep arately. W hen this is too costly, or there is insu f® cient data fo r

robu st estim ation, a quick solution is to sph ere the data by setting the varian ce in

each dim ensio n to b e the sam e and with zero m ean.

3.1. T he T w o-sp irals Prob lem

T he `tw o-spira ls’ p roblem consists of a training set with 194 X ± Y valu es, h alf of

wh ich are to prod uce a 1 output and half a 0 output. T hese training p osts are

arran ged in tw o interlocking sp irals that go aro und the origin three tim es, as sh own

in F igure 1.

T he problem was p ropo sed to the C M U benchm ark by A lexis W ieland of

M ITR E Corpo ration (see A ppe ndix A for a description of the problem ). It ap pears

to be extrem ely hard fo r bac kpropo gation networks due to its h igh non-lin earity.

It is easy to see that the two-d im ensional po ints of the spirals could not b e

separated by a sm all com bination of linear separators. Lang and W itbrock (1988)

prop osed a 2± 5± 5± 5± 1 network with sh ort-cuts usin g 138 weights. T hey used a

variant of the q uick-p rop learn ing algorithm (Fahlm an, 1989) with weight decay.

T hey claim ed that the prob lem could not be solve d with sim pler architecture (i.e.

less laye rs or without sho rt-cuts). T heir result on the sam e data set seem s to give

po or generalization results. Baum and Lang (1991) dem onstrated that there are

m any sets of weights that would cause a 2± 50± 1 network to be consistent with the

training set; ho wever, the sin gle-layer feedfo rw ard architecture trained with error

bac kprop agation was unable to ® nd any of them when starting w ith ran dom initial

weights.

D
ow

nl
oa

de
d 

by
 [

 ]
 a

t 
03

:1
6 

22
 S

ep
te

m
be

r 
20

11
 

Bootstrapping w ith N oise 361

F ahlm an and Lebiere (1990) used the cascade-correlation architecture for this

prob lem . T hey got b etter resu lts, b ut still little `spiraln ess’ . Recently, D effu ant

(1995) su ggested the `perceptron m em b ran e’ m ethod that uses piecewise linear

surfaces as discrim inators, and applied it to the sp iral prob lem . H e used 29

perceptrons b ut had dif® culties cap turing the structure of the spirals due to the

piecewise linearity of his decision bo undaries.

T he two-spiral problem w as cho sen for this study because it is a hard prob lem

fo r bac kpropagation netw orks due to h igh non-linearity, it is a noise-free problem ,

and the generalization perform ance of diffe rent predictors can be easily visualized

on the two-d im ensional plan e.

In Section 5, we dem onstrate our m ethod on another well-know n m ach ine-

learn ing problem , the prediction of coronary artery disease b ased on the Cleveland

heart data, wh ich reside in the U niversity of C alifo rnia at Irvine (U CI) m ach ine-

learn ing repository (M urphy & A ha, 1992).

4. R esults on the S piral D a ta

4.1. Feedforward N etwork Architecture

W e used Riple y’ s (1996) S-P lus N N ET pac kage, wh ich im plem ents bac kpropaga-

tion. T he m inim ization criterion is M SE with w eight-d ecay regularization of the

fo rm

E 5 O
p

u t p 2 y p u 2 1 l O
i, j

w
2
i, j

wh ere t p is the target and y p the output fo r the p th exam p le pattern. w i , j are the

weights and l is a param eter that controls the am ount of w eight decay regulariza-
tion.

T he network architecture w as 2± 30± 1 (two inp uts, 30 hidden units and one

output). T he ® rst and last laye rs were fu lly connected to the hidden laye r giving

a total of 121 weights. Th e transfe r fu nction of the hidden and output units

was the logistic sigm oidal fu nction. Th e initial weights were ran dom from

U ( 2 0.7, 0.7). It shou ld be noted here that alth ough w e are training 5± 40
networks, the effe ctive num ber of param eters is not m ore (and probab ly even less)

than the num ber of param eters fo r a sin gle network. T his is because we do not

have the ¯ exibility to estim ate an optim al com bination of predictors, but rather

take the sim ple averag e of them.

B ase line results were obtained by training 40 network s without any regulariza-

tion. W e derived then an averag e p redictor wh ose output is the mean of all the 40

nets’ outputs (F igure 2 (top left)). T he p redictor h ad no sm oothness constrain ts

and therefore fou nd relatively linear bo undaries (this can also be seen in Figure 3

(top left), where a ® ve-n et ensem ble averag e is taken).

4.1.1 . E ffect of tra ining with noise on a ¯ exible predictor. W e trained 30 hidden-unit

networks using the b ootstrap m ethod (as describ ed in Section 3), with noise SD

ranging from « 5 0 to « 5 0.8, and K 5 10N . F igure 3 dem onstrates the effec t of
noise on the p redictor. Each im age is a thresho ld output of a ® ve-net ensem ble

averag e predictor. N oise level goes fro m « 5 0 in the upp er left im age through
« 5 0.8 in the lower right. Th e classi® cation results are draw n on a uniform grid
of 100 3 100 p oints (nam ely, a m uch larger test set) so as to get a clear view of the

D
ow

nl
oa

de
d 

by
 [

 ]
 a

t 
03

:1
6 

22
 S

ep
te

m
be

r 
20

11
 

362 Y. R aviv & N . Intrator

F igure 2. Sum m ary of 40-n et ensemb le resu lts. T op left: N o constrain ts ( no w eight

decay or noise). Top righ t: Op tim al weight decay ( l 5 3e 2 4) and no noise. Bottom
left: O ptim al noise (G aussian SD 5 0.35) and zero w eight decay. Bottom right:
O ptim al noise and optim al weight decay. T he classi® cation threshold in this ® gure

and the fo llowing ones is 0.5.

classi® cation bo undaries de® ned by the classi® er. It can be seen that for sm all

noise levels « , the ensemb le ave rage predictor is unable to ® nd any sm ooth
structure in the data and m erely over-® ts to the training data. F or m oderate levels

of noise, a better structure can be fo und, and for large levels of the noise, the data

are so corrup ted that again no structure can b e fou nd. T he optim al noise SD was

arou nd « 5 0.35.

4.1.2 . E ffect of weight-d ecay regula riz ation. W eight-d ecay regularization involve s

® nding an optim al param eter l that controls the am ount of w eight decay versu s the
bias of the net. W e trained networks with differen t l ’ s and fou nd that optim al
valu es were aro und l 5 3e 2 4. W h en com paring the effe ct of ave raging alone with
the effe ct of regularization via weight decay with no averaging, it turns out that the

bo otstrap m ethod (averaged over differen t initial network w eights) h as better

generalization p roperties than the weight-d ecay m ethod. The w eight-d ecay regu-

larization does not generaliz e well on the outer points, w here the training data are

m ore sparse .

4.1.3 . A pplyin g bootstrap to netw orks with weight decay. O ur best results w ere

obtained when app lying the B EN m ethod to netw orks with optim al weight-decay

regularization. Figure 4 demonstrates the effe ct of bo otstrap with noise on the

perfo rm ance of a ® ve-n et ensem ble trained with optim al weight decay. The effe ct

of ensem ble averaging over network s that were trained with diffe rent ran dom

initial conditions only is dem onstrated in the top left im age w hich represents no

D
ow

nl
oa

de
d 

by
 [

 ]
 a

t 
03

:1
6 

22
 S

ep
te

m
be

r 
20

11
 

Bootstrapping w ith N oise 363

F igure 3. Effe ct of training with diffe rent levels of G aussian noise. Ensem bles of ® ve

networks with no w eight decay and a varyin g degree of noise (top left is zero noise,

bottom right is noise with SD 5 0.8).

noise during training. O ptim al noise valu es are sim ilar to those obtained w hen

training with no weight decay, and are su rp risin gly high (see Figure 1 (right) fo r

the corrup tion of noise to the data). A lthough the results look better than those

with no weight decay, in the sense that the bo undaries look sm oother, they can still

be im p roved by ave raging on a larger ensem ble of network s. T his is dem onstrated

in the next section (Figure 2).

T he effec t of averag ing is su mm arized in Figure 2. It can be seen that the

40-n et ensem b le ave raging resu lts, with no weight decay and no noise are better

than the correspo nding ones wh en an ensemb le of ® ve nets is used (Figure 3).

Sim ilarly, the results for an ensem ble of 40 netw orks trained w ith optimal w eight

decay with no noise are better than the corresponding ® ve-n et ensem ble (Figure 4

(top left)). Finally, the com bination of weight decay, noise and 40-n et ensem ble

clearly gives the best resu lts (F igure 2 (bottom right)). T hus, while earlier work

suggested that a single-laye r feedfo rw ard network is not capable of capturing the

structure in the spiral data, it is evident that a netw ork ensem ble with strong

control over its capacity (via weight decay) wh ich is trained with h eavy noise can

discover the h ighly non-linear structure of the p roblem.

D
ow

nl
oa

de
d 

by
 [

 ]
 a

t 
03

:1
6 

22
 S

ep
te

m
be

r 
20

11
 

364 Y. R aviv & N . Intrator

F igure 4. Effe ct of training with diffe rent noise levels on ® ve-n et ensem b le networks

with w eight decay. N oise levels are as before, 0± 0.8 from top left to bo ttom right.

4.2. G eneralized Additive M odels

In this section, w e take a diffe rent ap pro ach . Instead of analyzin g a method that

has a hard tim e with the spiral data, we study a m odel that is very natural for it.

W e apply b ootstrap ping to a generalized additive m odel (GA M ) (H astie &

T ibshira ni, 1986, 1990) with a polyn om ial ® t of degree 1 on the sam e data. W e

had to op tim ize the degree of the p olynomial and the span degree, which

determ ines the sm oothness and the degree of locality of the estim ation.
3

D ue to

these ef® cient controls, this ¯ exib le m odel is m uch more appro priate for the spiral

data. F urtherm ore, this algorithm provides a unique m odel, i.e. fo r each set of

param eters, there is no variab ility in the pro duced m odels as opp osed to the

variab ility generated b y the random initial weights of a feedfo rw ard network. A ll of

this su ggests that there sh ould b e no reason to bo otstrap with noise, sin ce the

sm oothness and locality already can control the sm oothness of the boundary

surface, and there seem s no reaso n to corrupt the data with unfam iliar noise.

M oreover, there is no need to ave rage over several m odels sin ce there is no

variab ility due to diffe rent local m inim a of the resu lting m odel.

It is thus su rprisin g that even in this extreme case, boo tstrap ping w ith noise

im p roved the generalization results. Figure 5 depicts the results fo r variou s degrees

of noise added during training. It is clear that the b ootstrap im pro ves resu lts, and,

fu rtherm ore, sm all valu es of the noise sharp en the resu lt.

D
ow

nl
oa

de
d 

by
 [

 ]
 a

t 
03

:1
6 

22
 S

ep
te

m
be

r 
20

11
 

Bootstrapping w ith N oise 365

F igure 5. M odel estimation usin g G AM with b ootstrap. T en G A M p redictors are

averag ed using boo tstrap sam p les with varying degree of noise. T here is no noise (and

thus no averaging) at the top left resu lt.

5. C le velan d H eart D ata

In this section, we analyze the Cleveland heart data ( D etrano et al., 1989),

donated by D r R ob ert D etrano
4

to the U C I m ach ine-learning repository (M urphy

& Ah a, 1992). Th is data concerns diagnosis of coronary artery disease and has

been used in the past b y statisticians and by the m achine-le arn ing com m unity

(B razdil & H enery, 1994; D etrano et al., 1989; G ennari et al., 1988; Stensm o,

1995). F urther data and pre-proc essin g details are given in Ap pendix B. Th e

pre-processin g, which included rem oval of m issing valu es, sphe ring the data and

creating dumm y variab les to replac e categorial variab les, resu lted in a dram atic

im p rovem ent over p ast resu lts. M oreover, it revealed that in the new data

representation, the structure is very linear since logistic regressio n was ab le to

obtain a nine-fo ld cross-valid ation error of abou t 15.2% . A sim ilar error was

obtained by usin g extensive pre-p rocessing and tem po ral-diffe rence reinforcem ent

learn ing (Stensm o, 1995). Both resu lts are consistent w ith our fe edfow ard arch i-

tecture results with no noise injection and are (as far as we know) the current best

results on this data.5

It is thus a very challenging problem to N N s as deviation fro m linear structure

is very sm all,
6

and highly non-lin ear estim ators su ch as C AR T , radial-basis

fu nctions and K N N did not do so well on this data (B razdil & H enery, 1994). Th e

prob lem is com plem entary to the spiral p roblem that w as consid ered before; there,

we attem p ted to imp rove perform ance on a highly non-linear data which required

a large cap acity network , while h ere we try to im prove p erform ance on a relatively

linear pro blem using a sm all capacity network. In bo th cases, we sho w that noise

cannot b e replac ed by network size or w eight-d ecay regularization and is essen tial

fo r good perfo rm ance.

D
ow

nl
oa

de
d 

by
 [

 ]
 a

t 
03

:1
6 

22
 S

ep
te

m
be

r 
20

11
 

366 Y. R aviv & N . Intrator

F igure 6. R esults from logistic regressio n and fro m fe edforw ard networks w ith three

hidden units and varyin g degrees of w eight decay. Left: Per cent classi® cation error.

R igh t: R O C valu es. A ll results were obtained with nine-fold cross-v alidation on the

C leveland heart data. In both graph s, the ® rst b oxplo t from the left represents the

generalized linear m odel resu lts.

F igure 6 su m m arizes m odel com p arison of resu lts b etween logistic regression

and nine-fold cross-v alidation,7 with three hidden-u nit networks based on Riple y’ s

N N ET pac kage described in Section 4.1. Train ing was stop ped after 400 epochs

or earlier, base d on R ipley’ s conditions. T he network results w ere obtained by

training ® ve networks on each of the nine-fo ld cross-valid ation sets and averaging

their results. T hus, each classi® cation error is generated out of 45 networks. In

each of the follow ing ® gures, the statistics were obtained from 12± 20 sim ilar runs

differin g in ran dom initial conditions and cho ice of cross-valid ation sets fro m the

data. Th e cross-valid ation code is base d on the p ublic dom ain version of T ibsh i-

rani in Statlib.
8

Th e resu lts are su m m ariz ed by b oxplo ts
9

(H oaglin et al., 1983).

Each bo xplo t is base d on 500± 900 sin gle network runs. A s the ratio betw een the

two classes is differen t than one, classi® cation resu lts are not a very robu st m easure

fo r m odel com pariso n, sin ce they are b ased on a single classi® cation thresh old. For

exam ple, if one class represents only 10% of the data, then setting up the thresh old

to 1 will result in a trivial classi® er that w ill prod uce zero regard less of the inp ut

and w ill have only 10% error. T he receiver-ope rating characteristic (R OC ) (Good-

enough et al., 1974; H anley & M cN eil, 1982) is freq uently used in su ch m odel

com pariso ns, especially in clinical data (H enderson, 1993). This m easure has been

used b y the contributor of the data (D etrano, 1989) and in assessing neural

network perfo rm ance on other heart disease data (Lipp m ann et al., 1995).

F igure 6 im plies that the p erform ance of N N s (w ithout noise injection) as

m easu red by error rate and R O C valu es are slightly worse (not statistically

signi® cant) comp ared with logistic regression , and cannot be im prove d by weight-

decay regularization alone.

F igure 7 sho ws the effect of noise injection fo r vario us levels of weight decay

fo r an over-c ap acity arch itecture of nine hidden units. N oise levels in all the

fo llowing graph s represent the SD of the zero-m ean G aussian noise. A lthough

noise injection p roduces signi® cant im pro vem ent, the ab so lute valu es are su b-

optim al sin ce the arch itecture is too large. N ote, how ever, that the R O C valu es fo r

the 0.75 weight decay net are the highest comp ared with logistic regression

D
ow

nl
oa

de
d 

by
 [

 ]
 a

t 
03

:1
6 

22
 S

ep
te

m
be

r 
20

11
 

Bootstrapping w ith N oise 367

F
ig

u
r
e

7
.

In
te

g
ra

te
d

b
ia

s,
v
a
ri

a
n

c
e

a
n

d
to

ta
l

e
rr

o
r

fo
r

th
e

h
e
p

a
to

m
a

d
a
ta

se
t.

D
ow

nl
oa

de
d 

by
 [

 ]
 a

t 
03

:1
6 

22
 S

ep
te

m
be

r 
20

11
 

368 Y. R aviv & N . Intrator

F igure 8. The effe ct of noise injection is dim inished wh en no w eight decay is used

(com pare with Figure 9). A n op tim al architecture of three hidden units cannot

prod uce good results w ithout weight decay. Left: Classi® cation error. R igh t: R O C

valu es.

(G LM 2 R OC 5 0.903 6 0.001, N N ET 2 RO C 5 0.91 6 0.002; t 5 1.766, degrees
of fre edom (df) 5 21, P , 0.045; Z 5 1.691, P , 0.045) or with the optim al three
hidden-unit network. W e have been usin g bo th the t-statistic (H ogg & C raig,

1970) and the Z-statistic of the W ilcoxon test (Lehm ann, 1975) wh ich uses a

non-p aram etric ran k to test the differen ce in the m edians, as it is m ore rob ust to

outliers. The R OC results suggest that the classi® cation error of this m odel could

be im proved, po ssibly by averag ing over a larger num ber of network s. To see the

perfo rm ance of noise injection alo ne, we present results of noise injection into zero

weight-d ecay, op tim al architecture (Figure 8) and sho w that even under a low-

cap acity architecture, w eight decay is essential to stabilize the system .

O ptim al resu lts are p resented in F igure 9. W ith optim al w eight decay and

architecture, addition of noise achieves results w hich are better than any other

network, and better than logistic regression . M ean error of logistic regression was

15.27 6 0.18, m ean error for zero-no ise net w as 15.07 6 0.13 and m ean erro r fo r
noise w ith SD 5 0.3 was 14.56 6 0.22. The diffe rence b etween the op tim al neural
network and logistic regression is statistically signi® cant (t 5 2.196, df 5 26,

F igure 9. R esults for the op tim al arch itecture network . Left: Classi® cation error.

R ight: R O C values. N oise injection is h elpful and overall perform ance is optim al.

D
ow

nl
oa

de
d 

by
 [

 ]
 a

t 
03

:1
6 

22
 S

ep
te

m
be

r 
20

11
 

Bootstrapping w ith N oise 369

P , 0.018; Z 5 2.14, P , 0.016) and the differen ce to zero noise is signi® cant as
well (t 5 2.045, df 5 27, P , 0.025; Z 5 2.029, P , 0.021). T o our knowledge,
these are the best resu lts on the C leveland h eart data.

6. D iscussion

T he motivation to our ap proach com es fro m a key obse rvatio n regarding the

bias/varian ce decom po sition of prediction error, nam ely the fac t that ensem ble

averag ing does not affe ct the b ias portion of the error, b ut reduces the varian ce,

wh en the estim ators on w hich averaging is done are independent. Th e level of

noise affe cts the independency betw een the training sets, and thus the relative

im p rovem ent of ensem ble averag ing. H owever, the level of noise also affec ts the

quality of each predictor separately, increasing its varian ce by increasing the

variab ility in the data. Th us, there sho uld b e an optim al level of the noise (it m ay

not correspo nd to the true noise), which leads to optimal ensemb le p erfo rm ance.

T his p erform ance can be fu rther im proved if the varian ce of individual networks

can b e tem pered, e.g. with weight decay.

W e have dem onstrated the effe ct of noise injection on prediction in three

differen t cases. (i) H ighly non-linear (sp iral) data, usin g a non-appro priate m odel

(as the data are alm ost rad ially sym m etric and the neural net is not). This required

the use of an ensem ble of high capacity sin gle predictors and thus m ade the

regularization task challenging. It w as show n that the excess varian ce of high

cap acity m odels could only be effec tively trim m ed b y a com bination of all three

com po nents: w eight decay, noise injection and ensem ble averag ing. (ii) H ighly

non-lin ear (spiral) data with essen tially the p erfect m odel for it (G A M w ith locally

linear units). Even in this case, w here regularization pro vides the perfect bias to the

m odel, perform ance could be im pro ved by the comb ination. (iii) A highly linear

prob lem , wh ere prac tically any network has excess capacity. T his case is a

representative of a fam ily of clinical data sets, in which (linear) variab le selection

was ap plied to h ighly dim ensional data and resulted in a highly linear low-

dim ensional data structure. It was thus challenging to b e ab le to sh ow that the

B EN algorithm is useful in this case, and can lead to im pro ved classi® cation

results. P erform ance was also evalu ated based on the R O C m easure, as it is a

stan dard m odel com parison tool fo r clinical data analysis.

T he theoretical analysis su ggests that it is best to start with a very ¯ exible

fu nction ap proxim ation techniq ue (e.g. a feedfo rw ard network with a large num ber

of hidden units) and then control its capacity and sm oothness using noise and

averag ing. O ur conclusio ns are not restricted to arti® cial neural network esti-

m ation. W e sho w that sim ilar conclusions can be obtained when using a h ighly

¯ exib le G A M (H astie & Tibsh iran i, 1986).

A cknow led gem ents

Stim ulating discussion s with Leo B reim an, B rian R ipley and Chris B ishop are

gratefully acknowledged.

N otes

1. An exam ple of an identi® able m odel is (logistic) regression.

2. W here V ar ( f ) is de® ned by E X [( f X ( x ) 2 E X [ f X ( x )])
2
].

D
ow

nl
oa

de
d 

by
 [

 ]
 a

t 
03

:1
6 

22
 S

ep
te

m
be

r 
20

11
 

370 Y. R aviv & N . Intrator

3. In this case, the model amounts to a sum of locally linear functions around each of the training

sam ples.

4. V A M edical C enter, L ong Beach and C leveland C linic Foundation.

5. Recent best result of 23.1% on non-norm alized data was obtained by a com pany that provides

classi® cation w ith its ow n proprietary software (U D M , 1996).

6. This is a classical problem in clinical data in which variable selection w as done by a linear m ethod

and therefore the data contains m ostly variables with linear structure.

7. This is a standard use; see, for exam ple, results under the STATL O G E SPRIT project (Brazdil &

Henery, 1994).

8. http://www.stat.cm u.edu.

9. To read the boxplot: the white line in the m iddle of the box represents the m edian of the

distribution; the grey box represents the inter-quartile range such that the bottom of the box is the

® rst quartile and the top is the third quartile; the dashed line and its terminating line represent plus

and m inus 1.5 inter-quartile distance from the m edian; points lying outside this range are

considered outliers, each such point is represented by a w hisker.

10. C an be obtained from M urphy and Aha (1992).

R eferences

Baum , E. & Lang, K. (1991) C onstructing hidden units using exam ples and queries. In R. P.

L ippm ann, J. E . M oody & D . S. Touretzky (E ds), Advances in N eural Information Processing Systems,

V ol. 3. San M ateo, C A: M organ Kaufmann, pp. 904± 910.

Bishop, C .M . (1995) Training w ith noise is equivalent to Tikhonov regularization. N eural C omputation,

7, 108± 116.

Brazdil, P. & Henery, R. (1994) Analysis of results. In D . M ichie, D . J. Spiegelhalter & C . C . Taylor

(E ds), M achine Learning, N eural and Statistical C lassi® cation. N ew York: Ellis H orwood, pp. 175± 212.

Breim an, L. (1994) Bagging Predictors. Technical Report TR-421, D epartm ent of Statistics, U niversity

of C alifornia, Berkeley, C A.

D effuant, G . (1995) An algorithm for building regularized, piecewise linear discrimination surfaces: the

perceptron m embrane. N eural Computation , 7, 480± 489.

D etrano, R. (1989) Accuracy curves: An alternative graphical representation of probability data. Journal

of Clinical Epidem iology, 42, 983± 986.

D etrano, R., Janosi, A., Steinbru nn, W ., P ® sterer, M ., Schm id, J., Sandhu, S., Guppy, K ., Lee, S. &

Froelicher, V. (1989) International application of a new probability algorithm for the diagnosis of

coronary artery disease. Am erican Journal of C ardiology, 64, 304± 310.

E fron, B. & Tibshirani, R. (1993) An Introduction to the Bootstrap. N ew York: C hapm an and H all.

Fahlm an, S.E . (1989) Fast-learn ing variations on back-propagation: An empirical study. In D .

T ouretzky, G . H inton & T. Sejnow ski (E ds), Proceedings of the 1988 Connectionist M odels Summ er

School. San M ateo, C A: M organ K aufmann, pp. 38± 51.

Fahlm an, S.E. & Lebiere, C . (1990) The Cascade-C orrelation L earning Architectu re. C m u-cs-90-1 00,

C arnegie M ellon U niversity, Pittsburgh, PA.

G em an, S., Bienenstock, E . & D oursat, R. (1992) N eural networks and the bias-variance dilem m a.

N eural C omputation, 4, 1± 58.

G ennari, J.H ., L angley, P. & Fisher, D . (1988) M odels of incremental concept formation. Arti® cial

Intelligence, 40, 11± 61.

G oodenough, D .J., Rossm ann, K . & L usted, L.B. (1974) Radiographic applications of receiver

operating characteristic (RO C ) curves. Radiology, 110, 89± 95.

Hanley, J.A. & M cN eil, B.J. (1982) The m eaning and use of the area under a receiver operating

characteristic (RO C ) curve. Radiology, 143, 29± 36.

Hansen, L.K . & Salamon, P. (1990) N eural networks ensem bles. IEEE Transactions on Pattern Analysis

and M achine Intelligence, 12, 993± 1001.

Hastie, T. & Tibshirani, R. (1986) Generalized additive m odels. Statistica l Science , 1, 297± 318.

Hastie, T. & Tibshirani, R. (1990) Generalized A dditive M odels. London: C hapm an and H all.

Henderson, A.R. (1993) Assessin g test accuracy and its clinical consequences: a prim er for receiver

operating characteristic curve analysis. Annals of C linical Biochem istry, 30, 521± 539.

Hoaglin, D .C ., M osteller, F. & T ukey, J.W . (1983) U nderstand ing R obust and Exploratory D ata Analysis.

N ew York: W iley.

Hogg, R.V. & C raig, A.T. (1970) Introduction to M athem atical Statistics (3rd edn). T oronto, C anada:

M acm illan.

D
ow

nl
oa

de
d 

by
 [

 ]
 a

t 
03

:1
6 

22
 S

ep
te

m
be

r 
20

11
 

Bootstrapping w ith N oise 371

K rogh, A. & H ertz, J.A. (1992) A sim ple weight decay can improve generalization. In J. E. M oody, S.

J. Hanson & R. P. Lippmann (Eds), Advances in N eural Information Processing Systems, Vol. 4. San

M ateo, C A: M organ K aufm ann, pp. 950± 957.

L ang, K .J. & W itbrock, M .J. (1988) Learning to tell two spirals apart. In D . S. Touretzky, J. L . E llman,

T . J. Sejnowski & G. E. Hin ton (E ds), P roceedings of the 1988 Connectionists M odels, pp. 52± 59.

L ehm ann, E.L. (1975) N onparametrics: Statistical M ethods Based on Ranks. San Francisco, C A: H olden

and D ay.

L ippm ann, R.P., K ukolich, L . & Shahian, D . (1995) Predicting the risk of com plications in coronary

artery bypass operations using neural networks. In G. T esauro, D . Touretzky & T. L een (Eds),

A dvances in N eural Information P rocessing System s, V ol. 7. C ambridge, M A: M IT Press, pp. 1055±

1062.

M urphy, P.M . & Aha, D .W . (1992) UC I Repository of m achine learning databases. D epartment of

Information and C om puter Science, U niversity of C alifornia at Irvine.

Perrone, M .P. (1993) Improving Regression Estim ation: Averagin g M ethods for Variance Reduction with

Extension s to General C onvex M easure Optim ization. PhD thesis, Brow n U niversity, Institute for Brain

and N eural Systems, Providence, RI.

Ripley, B.D . (1996) Pattern Recognition and N eural N etworks. O xford Press.

Sietsma, J. & D ow, R.J.F. (1991) C reating arti® cial neural netw orks that generalize. N eural Networks,

4, 67± 79.

Stensmo, M . (1995) Adaptive Autom ated Diagnosis. PhD thesis, Royal Institute of Technology,

Stockholm , Sw eden.

U ltragem D ata M ining (U D M ) (1996) Esprit Statlog Benchm arks. Technical Report, Boulder C reek,

C A.

W olpert, D .H . (1992) Stacked generalization. N eural N etworks, 5, 241± 259.

A pp end ix A: The Sp ira l D ata

T he two-dim ensional spiral data10 (Lang & W itbrock, 1988) are given b y a vector

( x i , y i ) de® ned by:

x i 5 r i cos ( a i 1 k p /2), y i 5 r i sin ( a i 1 k p /2) (A 1)

wh ere

a i 5 p i /16, r i 5 0.5 1 i /16, i 5 0, . . . , 97 (A 2)

and k 5 1 fo r one class and 3 fo r the other class.

A pp end ix B: D etails a nd Pre-pro cessing of the C le velan d H eart D a ta

T he data in the U C I repo sitory contain 13 variab les out of abo ut 70 that were in

the original study. T he task is to predict the existence of a coronary artery disease

(C AD ) based on the m easurem ents. D ata fo r 303 p atients were obtained; 44% of

the patients were diagnosed with C A D . T he variabl e attributes are:

(1) A ge

(2) Sex

(3) Chest p ain typ e (4 valu esÐ converted to 3 binary variab les)

(4) Resting bloo d pressu re

(5) Serum cholesterol in m g dl
2 1

(6) Fasting bloo d su gar . 120 m g dl 2 1

(7) Resting electrocardiograp hic resu lts (values 0, 1, 2)

(8) M axim um h eart rate achieved

(9) Exercise-induced angina

(10) Oldpeak 5 ST depressio n induced b y exercise relative to rest

D
ow

nl
oa

de
d 

by
 [

 ]
 a

t 
03

:1
6 

22
 S

ep
te

m
be

r 
20

11
 

372 Y. R aviv & N . Intrator

(11) The slo pe of the peak exercise ST segm ent (converted to 2 binary variab les)

(12) N um b er of m ajor vesse ls (0± 3) coloured by ¯ ouroscopy (converted to 3

binary variables)

(13) Thal: 3 5 norm al; 6 5 ® xed defect; 7 5 reversible defect (converted to 2
binary variables)

W e h ave added dum m y variab les to replace the categorial and ordinal variab les fo r

variab les 3, 11, 12 and 13 and therefore w orked with 19 independent variab les.

T he continuous variabl es 1, 4, 5, 8 and 10 were sph ered (standard ized) by setting

the m ean of each of the variabl es to zero w ith unit varian ce. This step was

necessary as the data contain variab les that are on differen t scales, su ch as age and

bloo d pressu re. The original data contain 76 attributes and h ave m any m issing

valu es. The data used in most of the b enchmarks have only 13 attributes and a fe w

m issin g valu es wh ich we sim p ly replaced by their unconditional expectations. Th e

addition of dumm y variab les and data sp hering had a dram atic effe ct on the

classi® cation resu lts.

D
ow

nl
oa

de
d 

by
 [

 ]
 a

t 
03

:1
6 

22
 S

ep
te

m
be

r 
20

11