Bootstrapping with Noise: An Effective Regularization Technique This article was downloaded by: [ ] On: 22 September 2011, At: 03:16 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK Connection Science Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/ccos20 Bootstrapping with Noise: An Effective Regularization Technique YUVAL RAVIV a & NATHAN INTRATOR a a School of Mathematical Sciences, Sackler Faculty of Exact Sciences, Tel-Aviv University, Ramat Aviv, 69978, Israel Available online: 01 Jul 2010 To cite this article: YUVAL RAVIV & NATHAN INTRATOR (1996): Bootstrapping with Noise: An Effective Regularization Technique, Connection Science, 8:3-4, 355-372 To link to this article: http://dx.doi.org/10.1080/095400996116811 PLEASE SCROLL DOWN FOR ARTICLE Full terms and conditions of use: http://www.tandfonline.com/page/ terms-and-conditions This article may be used for research, teaching and private study purposes. Any substantial or systematic reproduction, re-distribution, re-selling, loan, sub-licensing, systematic supply or distribution in any form to anyone is expressly forbidden. The publisher does not give any warranty express or implied or make any representation that the contents will be complete or accurate or up to date. The accuracy of any instructions, formulae and drug doses should be independently verified with primary sources. The publisher shall not be liable for any loss, actions, claims, proceedings, demand or costs or damages whatsoever or howsoever caused http://www.tandfonline.com/loi/ccos20 http://dx.doi.org/10.1080/095400996116811 http://www.tandfonline.com/page/terms-and-conditions http://www.tandfonline.com/page/terms-and-conditions arising directly or indirectly in connection with or arising out of the use of this material. D ow nl oa de d by [ ] a t 03 :1 6 22 S ep te m be r 20 11 C onnection Scien ce, V ol. 8, N os 3 & 4, 1996, 355± 372 B ootstrapping with N oise: A n E ffective R egu larization T echniqu e Y U V A L R A VIV & N AT H A N IN T RA T OR Bootstrap sam ples w ith n oise are shown to be an effective sm ooth ness and capacity control technique for tra ining feedforward network s and for other statistical m ethods such as genera lized additive m odels. It is shown that noisy bootstrap perform s best in conjunc tion w ith w eight-decay regula riz ation and ensemble averaging. T he tw o-sp iral problem , a highly non-lin ear, noise-free data, is used to dem onstra te these ® ndings. T he com bination of noisy bootstrap and ensemble averaging is also show n useful for generalized additive m odelling, and is also dem onstra ted on the w ell-k now n C leveland heart data. K EYW ORDS : N oise injection, com b ining estim ators, pattern classi® cation, two-spira l prob lem , clinical data analysis. 1. Intro du ction T he boo tstrap technique h as b ecom e one of the majo r tools for prod ucing em pirical con® dence intervals of estimated param eters or predictors (Efro n & T ibshira ni, 1993). One way to view b ootstrap is as a m ethod to sim ulate noise inherent in the data, and thus increase effec tively the num ber of training patterns. A sim p le b ootstrap p rocedure am ounts to sam p ling with return fro m the training data, and constructing several training sets, all with the sam e size as the original training set. Later, the variab ility betw een the estim ated param eters can b e m easu red, and give som e indication abo ut the true variab ility of the m odel param eters arisin g from the data. Furthermore, variabi lity of the p rediction, or error bars on the prediction, can also be estim ated in this way. O ne varian t of b ootstrap involve s estim ation of a m odel of the form y 5 f(x) 1 « fo r som e p aram etric fam ily to which f belongs, and a noise « wh ich is assu m ed to be sm all with zero m ean. O nce an em pirical fu nction fà h as been estim ated fro m n training sam ples, there rem ains a noise vector « 5 ( « 1, ¼ , « n). O ne can then sam p le from the em pirical distribu tion of the noise b y sam pling (w ith return) from « i and constructing new sam ples of the form (x * i , y * i ), in which « i was replac ed by « *i sam pled from the abo ve set. Clearly, this app roach can b e easily extended to a Y. Raviv and N . Intrator, School of M athematical Sciences, Sackler Faculty of E xact Sciences, Tel-Aviv U niversity, Ram at Aviv 69978, Israel. E-mail: {yuv,nin}@ math.tau.ac.il. Present address of N . In trator, Institute of Brain and N eural System s, Box 1843, Brown U niversity, Providence, RI 02912, U SA. 0954± 0 091/96/030355± 1 8 $7.50 Ó 1996 Journals Oxford Ltd D ow nl oa de d by [ ] a t 03 :1 6 22 S ep te m be r 20 11 356 Y. R aviv & N . Intrator sm oothed bo otstrap (E fron & T ibshirani, 1993) by sam p ing fro m the em p irical distrib ution by « i rather than just sam pling from the set of « i’ s. In such case, one can increase the size of each bo ostrap set, sin ce due to the noise, the differen t sets are su f® ciently independent. It sh ould b e noted that if fà is b iase d, the noise vector m ay be over-estim ated. F or classi® cation p roblems, the form y 5 f(x 1 « ) m ay b e more ap pro priate. In this case, using noise injection to the inputs during training can im prove the generalization p roperties of the estim ator (Sietsm a & D ow, 1991). R ecently, B ish op (1995) has sh own that training w ith sm all am ounts of noise is locally equivalent to sm oothness regularization. In this paper, we give a diffe rent interpretation to noise added to the input during training, and view it as a regularizing param eter that controls, in conju nction w ith ensem b le averaging, the cap acity and the sm oothness of the estim ator. T he m ajo r role of this noise is to push diffe rent estimators to diffe rent local m inim a, and so prod uce a m ore independent set of estim ators. B est perform ance is then achieved by ave raging over the estim ators. For this regularization, the level of the noise m ay b e larger than the `true’ level wh ich can b e indirectly estim ated. Since we want to study the effec t of bo otstrappin g with noise on the sm oothness of the estim ator, separated from the task of inp ut noise estim ation, w e consider a highly non-lin ear, noise-fre e classi® cation prob lem , and sho w that even in this extreme case, addition of noise during training im proves results signi® cantly. W e cho se a p roblem that is very dif® cult fo r fe edfo rw ard neural networks (N N s). It is dif® cult due to the h ighly non-lin ear nature of the decision bo undaries, and the fac t that these non-lin earities are easier to represent in local radially sym m etric fu nctions rath er than in ridge fu nctions such as those given by fe edforw ard sigm oidal fu nctions. Since the training data are given w ith no noise, it seems unreasonable to train a netw ork with noise, but we sho w that even in this case training with noise is a very effective app roach fo r sm oothing the estim ator. In addition to dem onstrating our m ethod on a diffe rent class of predictorsÐ the generalized additive m odelsÐ we also app ly it to another well-kn own data setÐ the C leveland heart data (D etrano et al., 1989). 2. T heoretical C onsideratio ns T here are a num ber of factors that have to be app lied carefully w hen trying to regularize an estim ator. T he regularization is aim ed at ® nding an op tim al trade-off between the varian ce and b ias of the estim ator (G em an et al., 1992), and fo r best perfo rm ance one has to utilize this decom po sition of the error fu nction. Th e m otivation to our ap pro ach follow s from a key ob servatio n regarding the b ias variance decom positio n, nam ely the fact that ensem b le ave raging does not affe ct the bias p ortion of the error, b ut reduces the varian ce w hen the estim ators on wh ich averaging is done are independent. 2.1. Bias/V ariance T rade-off for En semble of P redictors T he classi® cation pro blem is to estim ate a fu nction f X (x) of observed data charac teristics x, predicting class labe l y, b ased on a given training set X 5 {(x 1, y1)}, . . . , (X L, Y L)} using som e m easure of the estim ation error on X . D ow nl oa de d by [ ] a t 03 :1 6 22 S ep te m be r 20 11 Bootstrapping w ith N oise 357 A good estim ator w ill perfo rm well not only on the training set, but also on new valid ation sets wh ich were not used during estim ation. Evalu ation of the perfo rm ance of the estim ator is com m only done via the m ean squ ared error (M SE) distance b y taking the expectation with respect to the (unknown) p robability distribu tion P of y: E[(y 2 f X (x))2 u x, X ] T his can be decom p osed into E[(y 2 f X (x))2 u x, X ] 5 E[(y u x])2 u x, X ] 1 E[(f X (x) 2 E [y u x])2] T he ® rst term does not depend on the training data X or on the estim ator f X (x); it m easu res the am ount of noise or variab ility of y given x. H ence, f can b e evaluated using E[(f X (x) 2 E[y u ])2] T he em pirical M SE of f is given by E X [(f X (x) 2 E[y u x])2] wh ere E X represents expectation with resp ect to all possib le training sets X of ® xed size. T o see fu rther the perfo rm ance under M SE, we decom po se the error to b ias and varian ce com po nents to get E X [(f X (x) 2 E[y u x)2] 5 (E X [f X (x)] 2 E[y u x])2 1 E X [(f X (x) 2 E X [fD (x)])2] (1) T he ® rst term on the right-han d sid e is called the b ias of the estim ator and the second term is called the variance. W hen training on a ® xed training set X , reducing the bias with resp ect to this set m ay increase the variance of the estimator and contribute to po or generalization perform ance. T his is known as the trade-off between variance and bias. T ypic ally, varian ce is reduced by sm oothing; how ever, this m ay introduce bias (since, fo r exam ple, it m ay blur sharp peaks). B ias is reduced by prio r know ledge. W hen prio r know ledge is used also fo r sm oothing, it is likely to reduce the overall M SE of the estim ator. W hen training N N s, the varian ce arise s from tw o term s. T he ® rst term com es from inherent data ran dom ness and the second term com es fro m the non- identi® ability of the m odel, nam ely, the fac t that for a given training data, there m ay be several (local) m inim a of the error su rface. 1 C onsid er the ensem ble ave rage fÅ of several predictors, e.g. N N s with diffe rent random initial w eights which are trained on data with added G aussian noise: fÅ (x ) 5 1 N O N i 5 1 f i (x ) T hese predictors are identically distributed and, thus, the varian ce contribution (equation (1)) becom es (we om it x and X for clarity) E [( fÅ 2 E [ fÅ ])2 ] 5 E F S 1N O f i 2 E F 1 N O f i G D 2 G 5 E F S 1N O f i D 2 G 1 S E F 1N O f i G D 2 2 2E F 1N O f i E F 1 N O f i G G D ow nl oa de d by [ ] a t 03 :1 6 22 S ep te m be r 20 11 358 Y. R aviv & N . Intrator 5 E F S 1N O f i D 2 G 2 S E F 1N O f i G D 2 (2) T he ® rst term on the righ t-h and sid e can be rewritten as E F S 1N O f i D 2 G 5 1N 2 O E [ f 2i ] 1 2 N 2 O i , j E [ f i f j ] and the second term gives S E F 1N O f i G D 2 5 1 N 2 O E [ f i ] 1 2N 2 Oi , j E [ f i ]E [ f j ] Plugging these equalities in equation (2) gives E [( fÅ 2 E [ fÅ ])2 ] 5 1 N 2 O {E [ f 2i ] 2 (E [ f i ])2} 1 2N 2 Oi , j {E [ f i f j ] 2 E [ f i ]E [ f j ]} (3) If the predictors { f i } are highly correlated, fo r exam p le if f i 5 f j 5 f fo r all i , j , then the abo ve equation becom es Var ( fÅ ) 5 1 N V ar ( f ) 1 2 N 2 N ( N 2 1) 2 V ar ( f ) 5 Var ( f ) nam ely, there is no reduction in varian ce 2 in this case. If the predictors are identically distrib uted and independent, then the second term drop s and we are left with Var ( fÅ ) 5 1 N V ar ( f i ) N ote that E [ f i f j ] 2 E [ f i ]E [ f j ] 5 E ({ f i 2 E [ f i ]}{ f j 2 E [ f j ]}) T hus, the notion of independence can be understood as independence of the deviations of each predictor from the expected valu es of the p redictor, which can be replaced (due to linearity) by E ({ f i 2 E [ fÅ ]}{ f j 2 E [ fÅ ]}) and is thus interpreted as an independence of the prediction variation aro und a com m on m ean. T he success of ensem ble averag ing of N N s in the past (B reim an, 1994; H ansen & Salam on, 1990; P errone, 1993; W olp ert, 1992) is due to the fact that N N s have in general m any local m inim a, and thus even with the sam e training set, diffe rent local m inim a are fo und when starting from diffe rent ran dom initial conditions. T hese diffe rent local m inima lead to so m ewh at independent predictors, and thus the averaging can reduce the varian ce. W hen a larger set of independent networks is needed, but no m ore data are availab le, data reuse m ethods can help. Bootstrap- ping (Breim an, 1994) has been very helpfu l, since by resam p ling (with return) from the training data, the independence of the training sets is increased, and hence, the independence of the estim ators, leading to im prove d ensem ble results. Sm oothed bo otstrap (K rogh & H ertz, 1992; R ipley, 1996) is potentially m ore usefu l sin ce larger sets of independent training sam ples can be generated. Th e sm oothed bo otstrap approa ch am ounts to generating larger data sets by sim ulating the true noise in the data. D ow nl oa de d by [ ] a t 03 :1 6 22 S ep te m be r 20 11 Bootstrapping w ith N oise 359 3. T he B ootstra p E nsem ble w ith N oise A lgorithm In the boo tstrap ensem ble with noise (B EN ), w e p ush the idea of noise injection fu rther; we observe that adding noise to the inputs increases the ® rst term on the right-han d sid e of equation (3), i.e. adds varian ce to each estim ator, but, on the other han d, decreases the contribution of the second term on the right-han d sid e as it increases the independence between estim ators. Instead of usin g the `true’ noise (estim ated from the data) for b ootstrap, we seek an optim al noise level which gives the sm allest contribution to the error fro m the sum of the two com po nents of the varian ce. It is im po ssible to calculate the optimal varian ce of the G aussian noise without k nowing f explicitly; therefore, the value of this varian ce rem ains a regularization term : a param eter which has to be estim ated so as to m inim ize the total contribution of the varian ce to the error. Furtherm ore, sin ce the injection of noise increase s the independence betw een diffe rent training sets, we can use bo otstrap sets that are larger than the original training set. Th is does not affe ct the bias (if the noise is sym m etric arou nd zero) but can reduce the varian ce. N ote that the bias contribution to the error is not affe cted by introducing the ensem ble- averag e estim ator due to linearity of expectations. It fo llows that the BEN approa ch has the po tential of reducing the contribution of the varian ce term to the total erro r. W e thus sho uld seek a differen t trade-off po int b etween the contribution of the varian ce and the b ias. In other words, we are able to use large (unbiased ) networks w ithout being affe cted by the large varian ce assoc iated w ith su ch networks. T his obse rvatio n im plies that the estim ation of optim al noise levels shou ld not be based on a single estim ator perform ance, b ut rath er base d on the ensem ble perfo rm ance. Th e large varian ce of each sin gle network in the ensem ble can be tem pered w ith a regularization such as w eight decay (K rogh & H ertz, 1992; R ipley, 1996), but, again, the estim ation of the optim al regularization factor sho uld be done on the ensem ble-averaged perfo rm - ance. Breiman (1994) and R ipley (1996) sho w com pelling em pirical evidence fo r the im po rtance of weight decay as a single network stab ilizer. Our resu lts con® rm this fact under the BEN m odel. T he BEN Algorithm · Let {( x i , y i )} be a set of training p atterns for i 5 1, . . . , N . · Let « 5 { « 1, . . . , « J }. · Let l 5 { l 1, . . . , l I }. · F or a noise level « j estim ate an op tim al penalty term fo r weight decay l i : Ð Fix a size K fo r the bo otstrap sam ple, su ch that K @ N (we used K 5 10N ). Ð Let s 1, s 2 , . . . , s K b e a set of indices, chosen fro m a unifo rm distribution, s i , U (1, N ). Ð For a « j , create a noisy b ootstrap resam p le of the training set inputs: { x s i 1 z i }i 5 1,...,K and the corresp onding resam pled outputs { y s i }i 5 1,...,K wh ere z i is a vector who se com po nents are N (0, « 2j ). Ð Train several networks with the noisy sam ples using weight decay l 1, . . . , l I . Ð G enerate an ensem ble averag e of the set of netw orks. Ð Ch oose via cross-v alidation or a test set, the op tim al weight decay l . · R epeat the process for the new choice of noise « j until there is no im pro vem ent in prediction. In the sim ple case, the sam e noise level is used for each dim ension. T his is su itable D ow nl oa de d by [ ] a t 03 :1 6 22 S ep te m be r 20 11 360 Y. R aviv & N . Intrator F igure 1. Th e two-spira ls training data (left). T raining points with noiseÐ standard deviation, SD 5 0.3 (right). A s can b e seen, the noise level that contam inates the data causes objects to cross the virtual bo undary de® ned b y the data, i.e. the noise leads to wrong class lab elling for the training data. Th is reduces perfo rm ance of sin gle predictors, but the added independence b etween the predictors leads to im proved ensem ble perform ance. fo r problem s in wh ich each of the dim ensions are on the sam e scale, or, m ore precisely, wh en the noise distribu tion is sim ilar in differen t data dim ensions. W hen all covariates have the sam e interpretation, e.g. sim ilar m easu rem ents taken at differen t tim e steps, or w hen dealing with pixel data, su ch noise assu m ption is adequate; ho wever, wh en the noise is non-h om ogeneous in spac e, has a non- diagonal covariance m atrix or wh en differen t dim ensions represent com pletely differen t m easurem ents, it is best to estim ate the diffe rent noise levels in each dim ension sep arately. W hen this is too costly, or there is insu f® cient data fo r robu st estim ation, a quick solution is to sph ere the data by setting the varian ce in each dim ensio n to b e the sam e and with zero m ean. 3.1. T he T w o-sp irals Prob lem T he `tw o-spira ls’ p roblem consists of a training set with 194 X ± Y valu es, h alf of wh ich are to prod uce a 1 output and half a 0 output. T hese training p osts are arran ged in tw o interlocking sp irals that go aro und the origin three tim es, as sh own in F igure 1. T he problem was p ropo sed to the C M U benchm ark by A lexis W ieland of M ITR E Corpo ration (see A ppe ndix A for a description of the problem ). It ap pears to be extrem ely hard fo r bac kpropo gation networks due to its h igh non-lin earity. It is easy to see that the two-d im ensional po ints of the spirals could not b e separated by a sm all com bination of linear separators. Lang and W itbrock (1988) prop osed a 2± 5± 5± 5± 1 network with sh ort-cuts usin g 138 weights. T hey used a variant of the q uick-p rop learn ing algorithm (Fahlm an, 1989) with weight decay. T hey claim ed that the prob lem could not be solve d with sim pler architecture (i.e. less laye rs or without sho rt-cuts). T heir result on the sam e data set seem s to give po or generalization results. Baum and Lang (1991) dem onstrated that there are m any sets of weights that would cause a 2± 50± 1 network to be consistent with the training set; ho wever, the sin gle-layer feedfo rw ard architecture trained with error bac kprop agation was unable to ® nd any of them when starting w ith ran dom initial weights. D ow nl oa de d by [ ] a t 03 :1 6 22 S ep te m be r 20 11 Bootstrapping w ith N oise 361 F ahlm an and Lebiere (1990) used the cascade-correlation architecture for this prob lem . T hey got b etter resu lts, b ut still little `spiraln ess’ . Recently, D effu ant (1995) su ggested the `perceptron m em b ran e’ m ethod that uses piecewise linear surfaces as discrim inators, and applied it to the sp iral prob lem . H e used 29 perceptrons b ut had dif® culties cap turing the structure of the spirals due to the piecewise linearity of his decision bo undaries. T he two-spiral problem w as cho sen for this study because it is a hard prob lem fo r bac kpropagation netw orks due to h igh non-linearity, it is a noise-free problem , and the generalization perform ance of diffe rent predictors can be easily visualized on the two-d im ensional plan e. In Section 5, we dem onstrate our m ethod on another well-know n m ach ine- learn ing problem , the prediction of coronary artery disease b ased on the Cleveland heart data, wh ich reside in the U niversity of C alifo rnia at Irvine (U CI) m ach ine- learn ing repository (M urphy & A ha, 1992). 4. R esults on the S piral D a ta 4.1. Feedforward N etwork Architecture W e used Riple y’ s (1996) S-P lus N N ET pac kage, wh ich im plem ents bac kpropaga- tion. T he m inim ization criterion is M SE with w eight-d ecay regularization of the fo rm E 5 O p u t p 2 y p u 2 1 l O i, j w 2 i, j wh ere t p is the target and y p the output fo r the p th exam p le pattern. w i , j are the weights and l is a param eter that controls the am ount of w eight decay regulariza- tion. T he network architecture w as 2± 30± 1 (two inp uts, 30 hidden units and one output). T he ® rst and last laye rs were fu lly connected to the hidden laye r giving a total of 121 weights. Th e transfe r fu nction of the hidden and output units was the logistic sigm oidal fu nction. Th e initial weights were ran dom from U ( 2 0.7, 0.7). It shou ld be noted here that alth ough w e are training 5± 40 networks, the effe ctive num ber of param eters is not m ore (and probab ly even less) than the num ber of param eters fo r a sin gle network. T his is because we do not have the ¯ exibility to estim ate an optim al com bination of predictors, but rather take the sim ple averag e of them. B ase line results were obtained by training 40 network s without any regulariza- tion. W e derived then an averag e p redictor wh ose output is the mean of all the 40 nets’ outputs (F igure 2 (top left)). T he p redictor h ad no sm oothness constrain ts and therefore fou nd relatively linear bo undaries (this can also be seen in Figure 3 (top left), where a ® ve-n et ensem ble averag e is taken). 4.1.1 . E ffect of tra ining with noise on a ¯ exible predictor. W e trained 30 hidden-unit networks using the b ootstrap m ethod (as describ ed in Section 3), with noise SD ranging from « 5 0 to « 5 0.8, and K 5 10N . F igure 3 dem onstrates the effec t of noise on the p redictor. Each im age is a thresho ld output of a ® ve-net ensem ble averag e predictor. N oise level goes fro m « 5 0 in the upp er left im age through « 5 0.8 in the lower right. Th e classi® cation results are draw n on a uniform grid of 100 3 100 p oints (nam ely, a m uch larger test set) so as to get a clear view of the D ow nl oa de d by [ ] a t 03 :1 6 22 S ep te m be r 20 11 362 Y. R aviv & N . Intrator F igure 2. Sum m ary of 40-n et ensemb le resu lts. T op left: N o constrain ts ( no w eight decay or noise). Top righ t: Op tim al weight decay ( l 5 3e 2 4) and no noise. Bottom left: O ptim al noise (G aussian SD 5 0.35) and zero w eight decay. Bottom right: O ptim al noise and optim al weight decay. T he classi® cation threshold in this ® gure and the fo llowing ones is 0.5. classi® cation bo undaries de® ned by the classi® er. It can be seen that for sm all noise levels « , the ensemb le ave rage predictor is unable to ® nd any sm ooth structure in the data and m erely over-® ts to the training data. F or m oderate levels of noise, a better structure can be fo und, and for large levels of the noise, the data are so corrup ted that again no structure can b e fou nd. T he optim al noise SD was arou nd « 5 0.35. 4.1.2 . E ffect of weight-d ecay regula riz ation. W eight-d ecay regularization involve s ® nding an optim al param eter l that controls the am ount of w eight decay versu s the bias of the net. W e trained networks with differen t l ’ s and fou nd that optim al valu es were aro und l 5 3e 2 4. W h en com paring the effe ct of ave raging alone with the effe ct of regularization via weight decay with no averaging, it turns out that the bo otstrap m ethod (averaged over differen t initial network w eights) h as better generalization p roperties than the weight-d ecay m ethod. The w eight-d ecay regu- larization does not generaliz e well on the outer points, w here the training data are m ore sparse . 4.1.3 . A pplyin g bootstrap to netw orks with weight decay. O ur best results w ere obtained when app lying the B EN m ethod to netw orks with optim al weight-decay regularization. Figure 4 demonstrates the effe ct of bo otstrap with noise on the perfo rm ance of a ® ve-n et ensem ble trained with optim al weight decay. The effe ct of ensem ble averaging over network s that were trained with diffe rent ran dom initial conditions only is dem onstrated in the top left im age w hich represents no D ow nl oa de d by [ ] a t 03 :1 6 22 S ep te m be r 20 11 Bootstrapping w ith N oise 363 F igure 3. Effe ct of training with diffe rent levels of G aussian noise. Ensem bles of ® ve networks with no w eight decay and a varyin g degree of noise (top left is zero noise, bottom right is noise with SD 5 0.8). noise during training. O ptim al noise valu es are sim ilar to those obtained w hen training with no weight decay, and are su rp risin gly high (see Figure 1 (right) fo r the corrup tion of noise to the data). A lthough the results look better than those with no weight decay, in the sense that the bo undaries look sm oother, they can still be im p roved by ave raging on a larger ensem ble of network s. T his is dem onstrated in the next section (Figure 2). T he effec t of averag ing is su mm arized in Figure 2. It can be seen that the 40-n et ensem b le ave raging resu lts, with no weight decay and no noise are better than the correspo nding ones wh en an ensemb le of ® ve nets is used (Figure 3). Sim ilarly, the results for an ensem ble of 40 netw orks trained w ith optimal w eight decay with no noise are better than the corresponding ® ve-n et ensem ble (Figure 4 (top left)). Finally, the com bination of weight decay, noise and 40-n et ensem ble clearly gives the best resu lts (F igure 2 (bottom right)). T hus, while earlier work suggested that a single-laye r feedfo rw ard network is not capable of capturing the structure in the spiral data, it is evident that a netw ork ensem ble with strong control over its capacity (via weight decay) wh ich is trained with h eavy noise can discover the h ighly non-linear structure of the p roblem. D ow nl oa de d by [ ] a t 03 :1 6 22 S ep te m be r 20 11 364 Y. R aviv & N . Intrator F igure 4. Effe ct of training with diffe rent noise levels on ® ve-n et ensem b le networks with w eight decay. N oise levels are as before, 0± 0.8 from top left to bo ttom right. 4.2. G eneralized Additive M odels In this section, w e take a diffe rent ap pro ach . Instead of analyzin g a method that has a hard tim e with the spiral data, we study a m odel that is very natural for it. W e apply b ootstrap ping to a generalized additive m odel (GA M ) (H astie & T ibshira ni, 1986, 1990) with a polyn om ial ® t of degree 1 on the sam e data. W e had to op tim ize the degree of the p olynomial and the span degree, which determ ines the sm oothness and the degree of locality of the estim ation. 3 D ue to these ef® cient controls, this ¯ exib le m odel is m uch more appro priate for the spiral data. F urtherm ore, this algorithm provides a unique m odel, i.e. fo r each set of param eters, there is no variab ility in the pro duced m odels as opp osed to the variab ility generated b y the random initial weights of a feedfo rw ard network. A ll of this su ggests that there sh ould b e no reason to bo otstrap with noise, sin ce the sm oothness and locality already can control the sm oothness of the boundary surface, and there seem s no reaso n to corrupt the data with unfam iliar noise. M oreover, there is no need to ave rage over several m odels sin ce there is no variab ility due to diffe rent local m inim a of the resu lting m odel. It is thus su rprisin g that even in this extreme case, boo tstrap ping w ith noise im p roved the generalization results. Figure 5 depicts the results fo r variou s degrees of noise added during training. It is clear that the b ootstrap im pro ves resu lts, and, fu rtherm ore, sm all valu es of the noise sharp en the resu lt. D ow nl oa de d by [ ] a t 03 :1 6 22 S ep te m be r 20 11 Bootstrapping w ith N oise 365 F igure 5. M odel estimation usin g G AM with b ootstrap. T en G A M p redictors are averag ed using boo tstrap sam p les with varying degree of noise. T here is no noise (and thus no averaging) at the top left resu lt. 5. C le velan d H eart D ata In this section, we analyze the Cleveland heart data ( D etrano et al., 1989), donated by D r R ob ert D etrano 4 to the U C I m ach ine-learning repository (M urphy & Ah a, 1992). Th is data concerns diagnosis of coronary artery disease and has been used in the past b y statisticians and by the m achine-le arn ing com m unity (B razdil & H enery, 1994; D etrano et al., 1989; G ennari et al., 1988; Stensm o, 1995). F urther data and pre-proc essin g details are given in Ap pendix B. Th e pre-processin g, which included rem oval of m issing valu es, sphe ring the data and creating dumm y variab les to replac e categorial variab les, resu lted in a dram atic im p rovem ent over p ast resu lts. M oreover, it revealed that in the new data representation, the structure is very linear since logistic regressio n was ab le to obtain a nine-fo ld cross-valid ation error of abou t 15.2% . A sim ilar error was obtained by usin g extensive pre-p rocessing and tem po ral-diffe rence reinforcem ent learn ing (Stensm o, 1995). Both resu lts are consistent w ith our fe edfow ard arch i- tecture results with no noise injection and are (as far as we know) the current best results on this data.5 It is thus a very challenging problem to N N s as deviation fro m linear structure is very sm all, 6 and highly non-lin ear estim ators su ch as C AR T , radial-basis fu nctions and K N N did not do so well on this data (B razdil & H enery, 1994). Th e prob lem is com plem entary to the spiral p roblem that w as consid ered before; there, we attem p ted to imp rove perform ance on a highly non-linear data which required a large cap acity network , while h ere we try to im prove p erform ance on a relatively linear pro blem using a sm all capacity network. In bo th cases, we sho w that noise cannot b e replac ed by network size or w eight-d ecay regularization and is essen tial fo r good perfo rm ance. D ow nl oa de d by [ ] a t 03 :1 6 22 S ep te m be r 20 11 366 Y. R aviv & N . Intrator F igure 6. R esults from logistic regressio n and fro m fe edforw ard networks w ith three hidden units and varyin g degrees of w eight decay. Left: Per cent classi® cation error. R igh t: R O C valu es. A ll results were obtained with nine-fold cross-v alidation on the C leveland heart data. In both graph s, the ® rst b oxplo t from the left represents the generalized linear m odel resu lts. F igure 6 su m m arizes m odel com p arison of resu lts b etween logistic regression and nine-fold cross-v alidation,7 with three hidden-u nit networks based on Riple y’ s N N ET pac kage described in Section 4.1. Train ing was stop ped after 400 epochs or earlier, base d on R ipley’ s conditions. T he network results w ere obtained by training ® ve networks on each of the nine-fo ld cross-valid ation sets and averaging their results. T hus, each classi® cation error is generated out of 45 networks. In each of the follow ing ® gures, the statistics were obtained from 12± 20 sim ilar runs differin g in ran dom initial conditions and cho ice of cross-valid ation sets fro m the data. Th e cross-valid ation code is base d on the p ublic dom ain version of T ibsh i- rani in Statlib. 8 Th e resu lts are su m m ariz ed by b oxplo ts 9 (H oaglin et al., 1983). Each bo xplo t is base d on 500± 900 sin gle network runs. A s the ratio betw een the two classes is differen t than one, classi® cation resu lts are not a very robu st m easure fo r m odel com pariso n, sin ce they are b ased on a single classi® cation thresh old. For exam ple, if one class represents only 10% of the data, then setting up the thresh old to 1 will result in a trivial classi® er that w ill prod uce zero regard less of the inp ut and w ill have only 10% error. T he receiver-ope rating characteristic (R OC ) (Good- enough et al., 1974; H anley & M cN eil, 1982) is freq uently used in su ch m odel com pariso ns, especially in clinical data (H enderson, 1993). This m easure has been used b y the contributor of the data (D etrano, 1989) and in assessing neural network perfo rm ance on other heart disease data (Lipp m ann et al., 1995). F igure 6 im plies that the p erform ance of N N s (w ithout noise injection) as m easu red by error rate and R O C valu es are slightly worse (not statistically signi® cant) comp ared with logistic regression , and cannot be im prove d by weight- decay regularization alone. F igure 7 sho ws the effect of noise injection fo r vario us levels of weight decay fo r an over-c ap acity arch itecture of nine hidden units. N oise levels in all the fo llowing graph s represent the SD of the zero-m ean G aussian noise. A lthough noise injection p roduces signi® cant im pro vem ent, the ab so lute valu es are su b- optim al sin ce the arch itecture is too large. N ote, how ever, that the R O C valu es fo r the 0.75 weight decay net are the highest comp ared with logistic regression D ow nl oa de d by [ ] a t 03 :1 6 22 S ep te m be r 20 11 Bootstrapping w ith N oise 367 F ig u r e 7 . In te g ra te d b ia s, v a ri a n c e a n d to ta l e rr o r fo r th e h e p a to m a d a ta se t. D ow nl oa de d by [ ] a t 03 :1 6 22 S ep te m be r 20 11 368 Y. R aviv & N . Intrator F igure 8. The effe ct of noise injection is dim inished wh en no w eight decay is used (com pare with Figure 9). A n op tim al architecture of three hidden units cannot prod uce good results w ithout weight decay. Left: Classi® cation error. R igh t: R O C valu es. (G LM 2 R OC 5 0.903 6 0.001, N N ET 2 RO C 5 0.91 6 0.002; t 5 1.766, degrees of fre edom (df) 5 21, P , 0.045; Z 5 1.691, P , 0.045) or with the optim al three hidden-unit network. W e have been usin g bo th the t-statistic (H ogg & C raig, 1970) and the Z-statistic of the W ilcoxon test (Lehm ann, 1975) wh ich uses a non-p aram etric ran k to test the differen ce in the m edians, as it is m ore rob ust to outliers. The R OC results suggest that the classi® cation error of this m odel could be im proved, po ssibly by averag ing over a larger num ber of network s. To see the perfo rm ance of noise injection alo ne, we present results of noise injection into zero weight-d ecay, op tim al architecture (Figure 8) and sho w that even under a low- cap acity architecture, w eight decay is essential to stabilize the system . O ptim al resu lts are p resented in F igure 9. W ith optim al w eight decay and architecture, addition of noise achieves results w hich are better than any other network, and better than logistic regression . M ean error of logistic regression was 15.27 6 0.18, m ean error for zero-no ise net w as 15.07 6 0.13 and m ean erro r fo r noise w ith SD 5 0.3 was 14.56 6 0.22. The diffe rence b etween the op tim al neural network and logistic regression is statistically signi® cant (t 5 2.196, df 5 26, F igure 9. R esults for the op tim al arch itecture network . Left: Classi® cation error. R ight: R O C values. N oise injection is h elpful and overall perform ance is optim al. D ow nl oa de d by [ ] a t 03 :1 6 22 S ep te m be r 20 11 Bootstrapping w ith N oise 369 P , 0.018; Z 5 2.14, P , 0.016) and the differen ce to zero noise is signi® cant as well (t 5 2.045, df 5 27, P , 0.025; Z 5 2.029, P , 0.021). T o our knowledge, these are the best resu lts on the C leveland h eart data. 6. D iscussion T he motivation to our ap proach com es fro m a key obse rvatio n regarding the bias/varian ce decom po sition of prediction error, nam ely the fac t that ensem ble averag ing does not affe ct the b ias portion of the error, b ut reduces the varian ce, wh en the estim ators on w hich averaging is done are independent. Th e level of noise affe cts the independency betw een the training sets, and thus the relative im p rovem ent of ensem ble averag ing. H owever, the level of noise also affec ts the quality of each predictor separately, increasing its varian ce by increasing the variab ility in the data. Th us, there sho uld b e an optim al level of the noise (it m ay not correspo nd to the true noise), which leads to optimal ensemb le p erfo rm ance. T his p erform ance can be fu rther im proved if the varian ce of individual networks can b e tem pered, e.g. with weight decay. W e have dem onstrated the effe ct of noise injection on prediction in three differen t cases. (i) H ighly non-linear (sp iral) data, usin g a non-appro priate m odel (as the data are alm ost rad ially sym m etric and the neural net is not). This required the use of an ensem ble of high capacity sin gle predictors and thus m ade the regularization task challenging. It w as show n that the excess varian ce of high cap acity m odels could only be effec tively trim m ed b y a com bination of all three com po nents: w eight decay, noise injection and ensem ble averag ing. (ii) H ighly non-lin ear (spiral) data with essen tially the p erfect m odel for it (G A M w ith locally linear units). Even in this case, w here regularization pro vides the perfect bias to the m odel, perform ance could be im pro ved by the comb ination. (iii) A highly linear prob lem , wh ere prac tically any network has excess capacity. T his case is a representative of a fam ily of clinical data sets, in which (linear) variab le selection was ap plied to h ighly dim ensional data and resulted in a highly linear low- dim ensional data structure. It was thus challenging to b e ab le to sh ow that the B EN algorithm is useful in this case, and can lead to im pro ved classi® cation results. P erform ance was also evalu ated based on the R O C m easure, as it is a stan dard m odel com parison tool fo r clinical data analysis. T he theoretical analysis su ggests that it is best to start with a very ¯ exible fu nction ap proxim ation techniq ue (e.g. a feedfo rw ard network with a large num ber of hidden units) and then control its capacity and sm oothness using noise and averag ing. O ur conclusio ns are not restricted to arti® cial neural network esti- m ation. W e sho w that sim ilar conclusions can be obtained when using a h ighly ¯ exib le G A M (H astie & Tibsh iran i, 1986). A cknow led gem ents Stim ulating discussion s with Leo B reim an, B rian R ipley and Chris B ishop are gratefully acknowledged. N otes 1. An exam ple of an identi® able m odel is (logistic) regression. 2. W here V ar ( f ) is de® ned by E X [( f X ( x ) 2 E X [ f X ( x )]) 2 ]. D ow nl oa de d by [ ] a t 03 :1 6 22 S ep te m be r 20 11 370 Y. R aviv & N . Intrator 3. In this case, the model amounts to a sum of locally linear functions around each of the training sam ples. 4. V A M edical C enter, L ong Beach and C leveland C linic Foundation. 5. Recent best result of 23.1% on non-norm alized data was obtained by a com pany that provides classi® cation w ith its ow n proprietary software (U D M , 1996). 6. This is a classical problem in clinical data in which variable selection w as done by a linear m ethod and therefore the data contains m ostly variables with linear structure. 7. This is a standard use; see, for exam ple, results under the STATL O G E SPRIT project (Brazdil & Henery, 1994). 8. http://www.stat.cm u.edu. 9. To read the boxplot: the white line in the m iddle of the box represents the m edian of the distribution; the grey box represents the inter-quartile range such that the bottom of the box is the ® rst quartile and the top is the third quartile; the dashed line and its terminating line represent plus and m inus 1.5 inter-quartile distance from the m edian; points lying outside this range are considered outliers, each such point is represented by a w hisker. 10. C an be obtained from M urphy and Aha (1992). R eferences Baum , E. & Lang, K. (1991) C onstructing hidden units using exam ples and queries. In R. P. L ippm ann, J. E . M oody & D . S. Touretzky (E ds), Advances in N eural Information Processing Systems, V ol. 3. San M ateo, C A: M organ Kaufmann, pp. 904± 910. Bishop, C .M . (1995) Training w ith noise is equivalent to Tikhonov regularization. N eural C omputation, 7, 108± 116. Brazdil, P. & Henery, R. (1994) Analysis of results. In D . M ichie, D . J. Spiegelhalter & C . C . Taylor (E ds), M achine Learning, N eural and Statistical C lassi® cation. N ew York: Ellis H orwood, pp. 175± 212. Breim an, L. (1994) Bagging Predictors. Technical Report TR-421, D epartm ent of Statistics, U niversity of C alifornia, Berkeley, C A. D effuant, G . (1995) An algorithm for building regularized, piecewise linear discrimination surfaces: the perceptron m embrane. N eural Computation , 7, 480± 489. D etrano, R. (1989) Accuracy curves: An alternative graphical representation of probability data. Journal of Clinical Epidem iology, 42, 983± 986. D etrano, R., Janosi, A., Steinbru nn, W ., P ® sterer, M ., Schm id, J., Sandhu, S., Guppy, K ., Lee, S. & Froelicher, V. (1989) International application of a new probability algorithm for the diagnosis of coronary artery disease. Am erican Journal of C ardiology, 64, 304± 310. E fron, B. & Tibshirani, R. (1993) An Introduction to the Bootstrap. N ew York: C hapm an and H all. Fahlm an, S.E . (1989) Fast-learn ing variations on back-propagation: An empirical study. In D . T ouretzky, G . H inton & T. Sejnow ski (E ds), Proceedings of the 1988 Connectionist M odels Summ er School. San M ateo, C A: M organ K aufmann, pp. 38± 51. Fahlm an, S.E. & Lebiere, C . (1990) The Cascade-C orrelation L earning Architectu re. C m u-cs-90-1 00, C arnegie M ellon U niversity, Pittsburgh, PA. G em an, S., Bienenstock, E . & D oursat, R. (1992) N eural networks and the bias-variance dilem m a. N eural C omputation, 4, 1± 58. G ennari, J.H ., L angley, P. & Fisher, D . (1988) M odels of incremental concept formation. Arti® cial Intelligence, 40, 11± 61. G oodenough, D .J., Rossm ann, K . & L usted, L.B. (1974) Radiographic applications of receiver operating characteristic (RO C ) curves. Radiology, 110, 89± 95. Hanley, J.A. & M cN eil, B.J. (1982) The m eaning and use of the area under a receiver operating characteristic (RO C ) curve. Radiology, 143, 29± 36. Hansen, L.K . & Salamon, P. (1990) N eural networks ensem bles. IEEE Transactions on Pattern Analysis and M achine Intelligence, 12, 993± 1001. Hastie, T. & Tibshirani, R. (1986) Generalized additive m odels. Statistica l Science , 1, 297± 318. Hastie, T. & Tibshirani, R. (1990) Generalized A dditive M odels. London: C hapm an and H all. Henderson, A.R. (1993) Assessin g test accuracy and its clinical consequences: a prim er for receiver operating characteristic curve analysis. Annals of C linical Biochem istry, 30, 521± 539. Hoaglin, D .C ., M osteller, F. & T ukey, J.W . (1983) U nderstand ing R obust and Exploratory D ata Analysis. N ew York: W iley. Hogg, R.V. & C raig, A.T. (1970) Introduction to M athem atical Statistics (3rd edn). T oronto, C anada: M acm illan. D ow nl oa de d by [ ] a t 03 :1 6 22 S ep te m be r 20 11 Bootstrapping w ith N oise 371 K rogh, A. & H ertz, J.A. (1992) A sim ple weight decay can improve generalization. In J. E. M oody, S. J. Hanson & R. P. Lippmann (Eds), Advances in N eural Information Processing Systems, Vol. 4. San M ateo, C A: M organ K aufm ann, pp. 950± 957. L ang, K .J. & W itbrock, M .J. (1988) Learning to tell two spirals apart. In D . S. Touretzky, J. L . E llman, T . J. Sejnowski & G. E. Hin ton (E ds), P roceedings of the 1988 Connectionists M odels, pp. 52± 59. L ehm ann, E.L. (1975) N onparametrics: Statistical M ethods Based on Ranks. San Francisco, C A: H olden and D ay. L ippm ann, R.P., K ukolich, L . & Shahian, D . (1995) Predicting the risk of com plications in coronary artery bypass operations using neural networks. In G. T esauro, D . Touretzky & T. L een (Eds), A dvances in N eural Information P rocessing System s, V ol. 7. C ambridge, M A: M IT Press, pp. 1055± 1062. M urphy, P.M . & Aha, D .W . (1992) UC I Repository of m achine learning databases. D epartment of Information and C om puter Science, U niversity of C alifornia at Irvine. Perrone, M .P. (1993) Improving Regression Estim ation: Averagin g M ethods for Variance Reduction with Extension s to General C onvex M easure Optim ization. PhD thesis, Brow n U niversity, Institute for Brain and N eural Systems, Providence, RI. Ripley, B.D . (1996) Pattern Recognition and N eural N etworks. O xford Press. Sietsma, J. & D ow, R.J.F. (1991) C reating arti® cial neural netw orks that generalize. N eural Networks, 4, 67± 79. Stensmo, M . (1995) Adaptive Autom ated Diagnosis. PhD thesis, Royal Institute of Technology, Stockholm , Sw eden. U ltragem D ata M ining (U D M ) (1996) Esprit Statlog Benchm arks. Technical Report, Boulder C reek, C A. W olpert, D .H . (1992) Stacked generalization. N eural N etworks, 5, 241± 259. A pp end ix A: The Sp ira l D ata T he two-dim ensional spiral data10 (Lang & W itbrock, 1988) are given b y a vector ( x i , y i ) de® ned by: x i 5 r i cos ( a i 1 k p /2), y i 5 r i sin ( a i 1 k p /2) (A 1) wh ere a i 5 p i /16, r i 5 0.5 1 i /16, i 5 0, . . . , 97 (A 2) and k 5 1 fo r one class and 3 fo r the other class. A pp end ix B: D etails a nd Pre-pro cessing of the C le velan d H eart D a ta T he data in the U C I repo sitory contain 13 variab les out of abo ut 70 that were in the original study. T he task is to predict the existence of a coronary artery disease (C AD ) based on the m easurem ents. D ata fo r 303 p atients were obtained; 44% of the patients were diagnosed with C A D . T he variabl e attributes are: (1) A ge (2) Sex (3) Chest p ain typ e (4 valu esÐ converted to 3 binary variab les) (4) Resting bloo d pressu re (5) Serum cholesterol in m g dl 2 1 (6) Fasting bloo d su gar . 120 m g dl 2 1 (7) Resting electrocardiograp hic resu lts (values 0, 1, 2) (8) M axim um h eart rate achieved (9) Exercise-induced angina (10) Oldpeak 5 ST depressio n induced b y exercise relative to rest D ow nl oa de d by [ ] a t 03 :1 6 22 S ep te m be r 20 11 372 Y. R aviv & N . Intrator (11) The slo pe of the peak exercise ST segm ent (converted to 2 binary variab les) (12) N um b er of m ajor vesse ls (0± 3) coloured by ¯ ouroscopy (converted to 3 binary variables) (13) Thal: 3 5 norm al; 6 5 ® xed defect; 7 5 reversible defect (converted to 2 binary variables) W e h ave added dum m y variab les to replace the categorial and ordinal variab les fo r variab les 3, 11, 12 and 13 and therefore w orked with 19 independent variab les. T he continuous variabl es 1, 4, 5, 8 and 10 were sph ered (standard ized) by setting the m ean of each of the variabl es to zero w ith unit varian ce. This step was necessary as the data contain variab les that are on differen t scales, su ch as age and bloo d pressu re. The original data contain 76 attributes and h ave m any m issing valu es. The data used in most of the b enchmarks have only 13 attributes and a fe w m issin g valu es wh ich we sim p ly replaced by their unconditional expectations. Th e addition of dumm y variab les and data sp hering had a dram atic effe ct on the classi® cation resu lts. D ow nl oa de d by [ ] a t 03 :1 6 22 S ep te m be r 20 11