Microsoft Word - Pinter _2010_ --- Calibrating ANNs by GO.docx


Calibrating Artificial Neural Networks by Global Optimization 
 

János D. Pintér1,2 

 
1: Faculty of Engineering, Özyeğin University, Kusbakisi Caddesi, �o.2, 34662 Istanbul, Turkey; 

www.ozyegin.edu.tr; janos.pinter@ozyegin.edu.tr. 

2: Pintér Consulting Services, Inc., Canada; www.pinterconsulting.com; janos.d.pinter@gmail.com. 

 
Technical Report, Özyeğin University, Istanbul 

Submitted for publication: July 2010 

 
ABSTRACT 

An artificial neural network (ANN) is a computational model − implemented as a computer program − that 
is aimed at emulating the key features and operations of biological neural networks. ANNs are extensively 
used to model unknown or unspecified functional relationships between the input and output of a “black 
box” system. In order to apply such a generic procedure to actual decision problems, a key requirement is 
ANN training to minimize the discrepancy between modeled and measured system output. In this work, we 
consider ANN training as a (potentially) multi-modal optimization problem. To address this issue, we 
introduce a global optimization (GO) framework and corresponding GO software. The practical viability of 
the GO based approach is illustrated by finding close numerical approximations of (one-dimensional, but 
non-trivial) functions.  
 
KEYWORDS 
 
Artificial Neural Networks, ANN model calibration by global optimization, Lipschitz Global Optimizer 
(LGO) solver suite, ANN implementation in Mathematica, MathOptimizer Professional (LGO for 
Mathematica), Illustrative numerical examples.  

 
1. Introduction  

 
A neuron is a single nerve cell consisting of a nucleus (cell body, the cell’s life 
support center), incoming dendrites with endings called synapses that receive stimuli 
(electrical input signals) from other nerve cells, and an axon (nerve fiber) with terminals 
that deliver output signals to other neurons, muscles or glands. The electrical signal 
traveling down on the axon is called neural impulse: this signal is communicated to the 
synapses of some other neurons. This way, a single neuron acts like a signal processing 
unit; the entire neural network has a complex, massive multi-processing architecture. 
Neural systems are essential for an organism, in order to adapt properly and quickly to its 
environment. 

An artificial neural network (ANN) is a computational model − implemented as a 
computer program − that is aimed at emulating the essential features and operations of 
neural networks. ANNs are used extensively to model complex relationships between 
input and output data sequences, or to find hidden patterns in data sets when the 
analytical (explicit model function based) treatment of the problem is tedious or currently 
not possible. ANNs replace these unknown functional relationships by adaptively 
constructed approximating functions (approximators). Such approximators are typically 


designed on the basis of training examples of a given task, such as recognising hand-
written words. For in-depth expositions related to the ANN paradigm consult e.g. Hertz et 
al., (1991), Cichocki and Unbehauen (1993), Smith (1993), Golden (1996), Ripley 
(1996), Mehrotra et al. (1997), Bar-Yam (1999), Steeb (2005), Cruse (2006).  

The mathematical model of an ANN is based on the key features of a neuron as 
outlined above. We shall consider first a single artificial neuron (AN), and assume that a 
finite sequence of input signals s=(s1, s2, …, sT) is received by its synapses: the signals st 
t=1,…T may be real numbers or vectors. The AN’s information processing capability will 
be modeled by introducing an aggregator (transfer) function a that depends on a certain 
parameter vector x=(x1, x2, …, xn). Subsequently, x will be chosen according to a chosen 
optimality criterion. The optimization is typically carried out in such a way that the 
sequence of model output signals m=(m1, m2,…, mT) generated by the function a=a(x, s) 
is as close as possible to the corresponding set of observations o=(o1, o2,…, oT).  

This generic description can be summarized symbolically as follows. 
 

s → AN → m  s=(s1, s2, …, sT), mt=a(x, st), t=1,…T.                           (1) 

minimize d(m, o) m=(m1, m2,…, mT), o=(o1, o2,…, oT), d is a discrepancy measure. 
 

A single AN modeled as above is called a perceptron. Flexible generalizations of a 
single perceptron based model are multiple perceptrons receiving the same input signals: 
such a structure is called an artificial neuron layer (ANL). To extend this structure 
further, multilayered networks can also be defined: in these several ANLs are connected 
sequentially. It is also possible to model several functional relationships simultaneously 
by the corresponding output sequences of a multi-response ANN.  

Figure 1 illustrates a multilayered feed-forward ANN that includes an input layer with 
multiple inputs, an intermediate (hidden) signal processing layer, and an output layer that 
emits multiple responses. The hidden layer’s AN units receive the input information, 
followed by certain decomposition, extraction, and transformation steps to generate the 
output information. Hence, the ANN implementation versions used in practice all share 
the features of adaptive, distributed, local, and parallel processing.  
 

Figure 1 Schematic representation of a multilayered ANN  
Source: http://en.wikipedia.org/wiki/Artificial_neural_network 


Following the generic description (1), the optimal parameterization strategy is 

summarized as follows. Given the functional form of a and the sequence s of input 
values, we want to find a parameterization x such that the difference between the output 
set m=a(x, s) and the observation set o is minimal. The discrepancy d(m, o) is typically 
expressed by a suitable norm function: d(m, o)=||m-o||.   

Once the parameter vector x becomes known − i.e. it has been estimated as a result of 
using the training I/O sequences s and o − the trained ANN can be used to perform 
various tasks related to the assessment of further input data such as {st} for t=T+1, T+2, 
and so on. Some important application examples of this paradigm are signal processing, 
data classification, and time series forecasting; further application areas will be 
mentioned later on. 
 
 
2. Function Approximation by A��s:  

    Theoretical Background and a Corresponding Optimization Model Formulation 

 
The key theoretical feature of ANNs is their ability to approximate (unknown) 

continuous functions by learning from observed data, under general analytical conditions. 
Specifically, the approximation theorem of Cybenko (1989) states that a feed-forward 
ANN with a single hidden layer that contains a finite number of neurons equipped with a 
suitable transfer function can serve as a universal approximator of any given continuous 
real-valued function f: Rk→R defined on [0,1]k. The approximation is uniform, in terms 
of the supremum norm of functions defined on [0,1]k. We will discuss this result below in 
some detail, and then introduce a corresponding optimization model. 

For a somewhat more complete picture of the theoretical foundations of ANNs, let us 
remark that Funahashi (1989) and Hornik et al. (1989) proved similar results to that of 
Cybenko.  Hornik et al. (1989) proved that feed-forward neural networks with a single 
hidden layer of suitable transfer function units can approximate any Borel measurable 
multivariate vector function f:Rk1→Rk2 uniformly, to arbitrary prescribed accuracy on an 
arbitrary compact set C⊂Rk1. For more detailed technical discussions of these and closely 
related approximation results, consult also the foundations laid out by Kolmogorov 
(1957) which, however, do not lead directly to the results cited here since the 
approximating formula depends on the function f.  More recent discussions of related 
function approximation results are presented by Park and Sandberg (1991, 1993), Lang 
(2005), Sutskever and Hinton (2008), with further topical references.  

Let us consider a single input signal st from R
k k≥1, and the corresponding scalar 

output of ot = f(st) for all t=1,..,T: here f is an unknown continuous model function that we 
wish to approximate by a suitably defined ANN. We remark that in our general model the 
input signals can be delivered to a single (synapse of an) AN, or to several synapses of 
several ANs at the same time: the corresponding information will be then aggregated by 
the ANN.  

Cybenko (1989) establishes the theoretical possibility of generating a parameterized 
approximating function a(x, st), using a fixed type of transfer function σ such that the 
approximation shown below is uniformly valid. 

  
f(st)~a(x, st):=∑ j=1,…,J αj σ(∑ i=1,…,I yji sti + θj).            (2) 
 

To relate this general result directly to an underlying ANN model, in (2) j=1,…,J denotes 
the index of the individual signal processing units ANj in the hidden layer. For each unit 
ANj, σ is a given function form to be specified below, αj, yj, and θj are model parameters 
to be determined depending on the unknown function f. Specifically, σ is a continuous 
real valued transfer function (in principle, σ could also be a function of j); ∑ i=1,…,I yji sti is 
the scalar product of the synaptic weight vector yj of ANj and the signal st=(st1,…, stI) 
component arriving from synapse i of ANj, for i =1,…,I;  θj is called the threshold level 
(offset, bias) of ANj; and αj is the weight of ANj in the function approximation (2) for 
j=1,…,J.  

Following Cybenko (1989) and Hornik et al. (1989), the transfer function σ is 
assumed to be monotonically non-decreasing, and it satisfies the relations 

 
lim σ(z)=0 as z→-∞, and lim σ(z)=1 as z→∞.            (3)  
 

Transfer functions with these properties are called sigmoidal or squashing functions. 
Let us remark here that other types of transfer functions can also be used in 

generating function approximations. Suitable sets of radial basis functions (RBF) are 
another well-known example (Park and Sandberg, 1991, 1993; Röpke et al., 2005). RBFs 
have the general form φ(z)=φ(||z−−−−z0||): here ||.|| is a given norm function, z0 is the (scalar 
or vector) centre of the monotonically non-increasing function φ that satisfies the 
conditions φ(0)=1, and lim φ(z)=0 as ||z−−−−z0||→∞.  

We will continue the present discussion using a specific sigmoidal transfer function 
form. An often used function form is  

 
σ(z)=1/(1+e-z) z∈R.               (4)  
 
For proper theoretical and better numerical tractability, we assume and define finite 

box (variable range) constraints to bound the values of the parameters αj, yji and θj: 
 

αjmin≤αj≤αjmax  for j=1,…,J. 
yjimin≤yji≤yjimax for i =1,…,I and j=1,…,J. 
θjmin≤θj≤θjmax  for j=1,…,J.                     (5) 

 
The entire set of model parameters will be denoted by x=({αj}, {yji}, {θj}) where the 

appropriate components are included for all i=1,…,I and j=1,…,J. The optimization 
problem induced for a given approximator function form is to minimize the error of the 
approximation (2) for the entire training sequence (st, ot) t=1,…,T, using a selected norm 
function. In this study, we shall use the least squares error criterion to measure the quality 
of the approximation: in other words, the objective function (6) shown below will be 
minimized under the box constraints (5).  
 

e(x)=∑ t=1,…,T (f(st) − a(x, st))
2              (6) 

 
The notation e(x) refers to the fact that (6) is an error function. 


Let us point out here that other (more general linear or nonlinear) constraints 
regarding feasible parameter combinations may also be present if dictated by the problem 
context. Such situations can also be handled by our model development approach, by 
simply adding these constraints to the optimization model formulation. An important case 
in point is to attempt the exclusion of identical (symmetric) solutions to in which the 
component functions of the function a are in principle exchangeable. To break this 
symmetry, one can assume e.g. that the components of {αj} are ordered so that  
 

αj≥αj+1 for all j=1,…,J-1.                            (7) 
 
One could also add further lexicographic constraints if dictated by the problem at hand. 
Another possible − but not always applicable − option could be to normalize the weights 
{αj}so that we have 
 
 ∑ j=1,…,J αj=1, αj≥0 for all j=1,…,J.             (8) 
 
 
3. Postulating and Calibrating A�� Model Instances  
 

Cybenko (1989) emphasizes that his cited result only guarantees a uniformly valid 
approximation of a continuous function f, as opposed to its exact representation. We 
should also keep in mind that the general approximation result expressed by (2) does not 
specify the actual number of processing units: hence, the key model parameter J needs to 
be estimated for each problem instance separately. Cybenko (1989) conjectures that J 
could become “astronomical” in many function approximation problems ‒ especially so 
in higher dimensional approximations. These points should be kept in mind, to avoid 
making unjustifiably optimistic claims.  

Once we select (test or postulate) the ANN structure to use in an actual case study, 
the remaining basic requirement is its application-specific parameterization with respect 
to a given training data set. The objective function in (6) makes this a nonlinear model 
fitting problem that could have a multitude of local or global optima. This multimodality 
issue is discussed in detail e.g. by Pintér (1996, 2003) with further references; in the 
context of ANNs, consult e.g. Auer et al. (1996), Bianchini and Gori (1996), Abraham 
(2004). For a more complete picture, let us remark that detailed expository discussions 
related to global optimization models, algorithms and applications are presented e.g. in 
the topical handbooks edited by Horst and Pardalos (1995), Pardalos and Romeijn (2002). 

Let us also observe here that following the selection of the function form σ, the 
number of real-valued parameters (decision variables to optimize) is J(I+2), each with 
corresponding box constraints; furthermore, the number of optionally added linear 
constraints − considering (7) and (8) − is J. Since in the general ANN model setting the 
values of I and J cannot be theoretically specified a priori, in practical applications one 
can try several computationally affordable combinations. In general, ANNs with more 
components could serve as better approximators, within reason and in line with the size 
of the training data set that should at least satisfy T≥J(I+2). At the same time, in order to 
assure the success of a suitable nonlinear optimization procedure (and a corresponding 


software), one should restrict I and J to manageable values. This choice will then 
influence the accuracy of the approximation obtained. 

Summarizing the preceding discussion, first one has to postulate an ANN structure, 
and the family of parameterized transfer functions to use. Next, one needs to provide a set 
of training (input-output) data. The training is then aimed at the selection of an optimally 
parameterized function from the chosen family of function models, to minimize the error 
function.  

The iterative training of ANNs ‒ based on local optimization techniques ‒ is often 
referred to in the topical literature as back-propagation. Due to the possible multimodality 
of the error function (6), the simplest “standard” tools of unconstrained local nonlinear 
optimization (direct search, gradient based methods, Newton-type methods) often fail to 
locate the best parameter combination. This issue is well-known for ANN researchers: 
consult e.g. Bianchini and Gori (1996), Sexton et al. (1998), Abraham (2004), Hamm et 
al. (2007). To overcome this difficulty, the various optimization strategies used in 
practice to train ANNs include experimental design, sampling by low-discrepancy 
sequences, theoretically exact global or local scope search approaches, as well as popular 
heuristics such as evolutionary optimization methods, particle swarm optimization, 
simulated annealing, tabu search and tunneling functions. ANN model parameterization 
frameworks and numerical studies are presented and discussed e.g. by Watkin (1993), 
Prechelt (1994), Bianchini and Gori (1996), Sexton et al. (1998), Jordanov and Brown 
(1999), Toh (1999), Ye and Lin (2003), Abraham (2004), Hamm et al. (2007).  

Next we will discuss a theoretically exact global optimization based approach and a 
corresponding software implementation that will be used subsequently to solve several 
illustrative ANN calibration problems.  

Similarly to most other researchers, we will study “only” the problem of estimating 
the parameters of given ANNs with a given (postulated) architecture. The problem of 
finding the most appropriate model form in full generality seems elusive, and it is 
certainly difficult − if not impossible − to address in actual numerical studies. 
 

4. Global Optimization: A General Modeling Framework 

 
To match the modeling approach and notations introduced earlier, we shall use the 
following symbols and assumptions: 

x∈Rn  n-dimensional real-valued vector of decision variables 

e:Rn→R continuous (scalar-valued) objective function 

D⊂Rn non-empty set of feasible solutions, a proper subset of Rn:  

The feasible set D is closed and bounded, defined by 

l∈Rn, u∈Rn component-wise finite lower and upper bounds on x, and (optionally) by  

g:Rn→Rm   an m-vector of continuous (more general) constraint functions. 

We shall then consider the continuous global optimization (GO) model 
 
min e(x)       subject to x∈D:={x:  l≤x≤u   gj(x)≤0   j=1,...,m}.                              (9) 
 

Without going into details which are not necessary for the present discussion, let us 
remark that the terse model formulation (9) covers many special cases. Specifically, it 
also subsumes the ANN model calibration model statement summarized by formulas (4) 
to (6), with the optionally added constraints (7) and (8). Additionally, the GO problem 
statement (9) also guarantees that the set of global solutions is non-empty, and its 
structure supports the application of (theoretically) globally convergent deterministic and 
stochastic optimization algorithms. For technical details, consult e.g. Pintér (1996).  

Needless to say, the numerical handling of GO model instances of (9) can be a tough 
challenge: even small-dimensional model instances can be hard to handle, and they 
essentially require an appropriate global scope search strategy. These general notes apply 
also to many ANN model instances as it will be illustrated shortly (in Section 6). 
Consequently, the calibration of ANNs ‒ as a rule ‒ requires proper GO software: one 
such software product will be briefly described next. 
 

5. The LGO Software Package for Constrained Global-Local Optimization  

 
Let us reiterate the key “black box” characteristic of the ANN modeling framework: 

namely, that its error function is evaluated by a numerical procedure that − in a model 
development and testing exercise − may be subject to changes and modifications. Similar 
situations that require (also) global scope search often arise in the real world of 
optimization: their solution requires easy-to-use, robust and efficient solver technology. 
Within the broad GO software category, direct (derivative-free) global search algorithm 
implementations have the advantage of immediate applicability to “black box” problems. 

Next, we shall briefly review the key features of the Lipschitz Global Optimizer 
(LGO) solver suite that has been designed to directly address also “black box” problems. 
Developed since the late 1980’s, LGO is currently available for a range of compiler 
platforms (C, C++, C#, Fortran 77/90/95), with seamless links to optimization modeling 
languages (AIMMS, AMPL, GAMS, MPL), MS Excel spreadsheets, and to the leading 
high-level technical computing systems Maple, Mathematica, and MATLAB. For 
theoretical, algorithmic and implementation details not discussed here, we refer e.g. to 
Pintér (1996, 1997, 2002, 2005, 2007, 2009, 2010a, 2010b), Pintér and Kampas (2005), 
Pintér et al. (2006).   

The overall design of LGO is based on the combination of continuous nonlinear 
optimization strategies, with corresponding theoretical global and local convergence 
properties. Hence, LGO can be used for both constrained global and local optimization.  

Let us point out here that LGO’s overall derivative-free design is different from many 
other nonlinear optimization software packages that require explicit analytical model 
information to support model parsing and function type specific operations. Of course, 
we do not claim that one design is “always superior” to the other. A model function 
decomposition-parsing-bounding based approach supports the precise solution of a range 
of GO models, assuming sufficient runtime and computational resources. All models, 
however, have to be defined in terms of a given set of possible component functions. In 
contrast to this approach, the design of LGO allows the handling of genuine “black box” 
problems that will remain outside of the scope of the analytical global optimization 
approaches referred to above. For optimization model examples with embedded 


computational procedures, consult e.g. Pintér (1996, 2009), Pintér et al. (2006), Kampas 
and Pintér (2006, 2009).  

In numerical practice (that is in hardware resource-limited and time-limited runs), 
LGO’s global search options generate a global solution estimate(s) that is (are) refined by 
the seamlessly following local search mode(s). This way, the expected result of using 
LGO is a global and local search based high-quality feasible solution that meets at least 
the local optimality criteria. (To guarantee theoretical local optimality, standard local 
smoothness conditions need to apply.) At the same time, one should keep in mind that no 
global ‒ or, in fact, any other ‒ optimization software will “always” work satisfactorily, 
with default settings and under resource limitations related to model size, time, model 
function evaluation, or to other preset usage limits.  

Extensive numerical tests and an increasing range of widely different practical 
applications demonstrate that LGO and its platform-specific implementations can find 
high-quality numerical solutions not only when using academic GO test problems, but 
also in often far more complicated, sizeable GO models. Examples of real-world 
optimization challenges handled by various LGO implementations are environmental 
systems modeling and management (Pintér, 1996), laser design (Isenor et al., 2003), 
intensity modulated cancer therapy planning (Tervo et al., 2003), the operation of 
integrated oil and gas production systems (Mason et al., 2007), vehicle component design 
(Goossens et al., 2007), various packing problems with industrial relevance (Castillo et 
al., 2008), fuel processing technology development (Pantoleontos et al., 2009), and 
currency trading strategies (Çağlayan and Pintér, 2010).  
 
 
6. �umerical Examples and Discussion  
 

The detailed testing and comparative assessment of optimization software products is 
a significant research area. One needs to follow or to establish objective evaluation 
criteria, and to present (in principle) fully reproducible results. The software tests 
themselves are based on a given set of test problems and a selection of criteria such as the 
reliability and efficiency of the solvers used, in terms of the quality of solutions obtained 
and the corresponding computational effort. For examples of benchmarking studies, in 
the ANN context we refer to Prechelt (1994), Hamm et al. (2007); in the GO context, see 
e.g. Ali et al. (2005), Khompatraporn et al. (2005), Pintér (2002, 2007); with further 
topical references therein.    

Here we will illustrate the performance of global optimization software (specifically, 
of an LGO implementation) in the ANN calibration context, by solving merely one-
dimensional, but non-trivial function approximation problems. The key features of the 
computational environment used and the numerical tests are summarized below: the 
complete set of detailed numerical results is available from the author upon request. 

 
• Hardware: PC, Intel™Core™2 Duo CPU P8400 @ 2.26GHz, 2.86 GB RAM. 
• Operating System: Windows XP Pro 2002 with SP 3 (32-bit). 
• Global optimization software: MathOptimizer Professional (MOP). Let us remark  

that MathOptimizer Professional is the software product name introduced for the 
LGO solver implementation that is linked to Mathematica (Wolfram, 2009). For 
details, consult e.g. Pintér and Kampas (2005), Kampas and Pintér (2006). 


• ANN computational models used: 1-J-1 type networks that have a single input node, 
a single output node, and J processing units. Recall that this setup directly follows the 
function approximation structure expressed by (2). The ANN models considered  are 
implemented in native Mathematica code.  

• ANN initialization by assigned starting parameter values is not necessary, since MOP 
(i.e., LGO) uses a genuine global scope optimization approach. Let us note at the 
same time that all LGO implementations also support the optional use of set initial 
values in their constrained local optimization mode, without applying first a global 
search mode. 

• A single run is done in all cases; no repetitions are needed, since MOP produces 
identical runs under the same initializations and solver options. Repeated runs with 
possibly changing numerical results can be easily generated by setting several options 
of MOP. The latter are based on selecting one of the several global search operational 
modes, using different starting points in the local search mode, using different 
random seeds, and assigning different computational resources to MOP runs. 

• In all test problems the optimized function approximations are based on a set of 
training points, in line with the modeling framework of Section 2. 

• Simulated noise structure: none. In other words, we assume that all training data are 
exact. This assumption could be easily relaxed, by introducing a probabilistic noise 
structure as done e.g. in Pintér (2003). 

• Key performance measure: standard error (the root of the mean square error, RMSE) 
calculated for the numerical solution found.  

• Some additional Mathematica features used: documentation and model visualization 
capabilities.  

• Some additional MOP features used here (or elsewhere): automatic generation of 
optimization model code in C or Fortran; and the automatic generation of input and 
result text files. These files are used and generated by the LGO solver core of MOP. 

 
In the numerical examples presented here, our objective is to “learn” the graph of 

certain one-dimensional functions defined on the interval [0,1] based on a preset number 
of uniformly spaced training points. As earlier, we shall use the notation T for the number 
of training data, and J for the number of processing units in the hidden layer. 
 
Example 1 
 

Our first (easiest) test problem presented here has been studied also by Toh (1999), as 
well as by Ye and Lin (2003).    
 
 f(x)=sin(2πx) cos(4πx) 0≤x≤1.            (10) 
 
Applying T=100, J=7, and MOP’s local search mode (LS), we obtain a corresponding 
function approximation characterized by RMSE ~ 0.00294674. With default MOP 
settings, the number of error function evaluations is 83,473; the corresponding runtime is 
4.20 seconds, implying an estimated 20,000 function evaluations per second. (Let us 
remark that the speed estimate can vary a bit, based on the actual status of the computer 
used.) For comparison, the function (10) is displayed below (see the left hand side of 


Figure 2), directly followed by list plot (right hand side) produced by the output of the 
trained ANN at the uniformly distributed T=100 sample points. 
 

Figure 2 Function (10), and the list plot produced by the optimized ANN. 
 

Both the numerical result (the RMSE obtained) and Figure 2 indicate a good fit. For a 
more complete picture, we also solved the same problem using LGO’s most extensive 
global search mode (stochastic multi-start, MS) followed by corresponding local searches 
from the best points found. This operational mode is referred to as MSLS, and it has been 
used in all subsequent numerical examples.  

In the MSLS mode, we assume (postulate) the variable ranges [lb, ub] = [-20, 20] for 
all ANN model parameters, and set a limit for the global search steps (error function 
evaluations) as gss=106. Based on our GO related numerical experience, this does not 
seem to be an exorbitant setting since we have to optimize 21 parameters globally. (For a 
simple comparison, one can think of aiming at just 10% precision for each variable 
setting in 21 dimensions, using a uniform grid based search: this naïve strategy would 
require 1021 function evaluations.) 

As a result of using the MSLS operational mode, we receive RMSE ~ 0.00166644 
which represents about 43.5% reduction of the average error found using only LS. The 
corresponding runtime is 59.77 seconds.  

It is also worth pointing out that the optimized ANN parameterizations found by LS 
and MSLS are quite different. Several components of these solutions lie on the boundary 
of the preset interval range [-20, 20] indicating a possible need for extending the search 
region. These findings, together with our other runs for this and other test problems 
indicate the potentially massive multi-modality of the objective (4) in similar or more 
complicated function approximation problems as well as the possible need to search on 
larger parameter regions.  

In this particular test example, we could have perhaps reduced the global search 
effort, or just use the local search based estimate which seems “good enough”. Such 
experimental “streamlining” may not always work, however, and we are fully convinced 
that global scope search is typically needed in calibrating ANNs. Hence, we will continue 
to demonstrate MOP’s standard global optimization capabilities, by solving all further 
test problems in MSLS mode.  

The examples presented below are chosen to be increasingly more difficult, although 
all are “only” approximation problems related to one-dimensional functions. Obviously, 
one could “fabricate” functions that become very difficult to match by ANNs. 
 

0.2 0.4 0.6 0.8 1.0

-1.0

-0.5

0.5

1.0

20 40 60 80 100

-1.0

-0.5

0.5

1.0


Example 2 
 
 f(x)=sin(2πx) sin(3πx) sin(5πx) 0≤x≤1.           (11) 
 

To handle this problem, we use T=100, J=10, and MOP’s MSLS search mode. Notice 
that the number of processing units has been increased heuristically, reflecting our belief 
that this function could be more difficult to reproduce than (10). This leads to a 30-
variable GO model: we keep the allocated global search effort the same (gss=106) as in 
Example 1, however.  

We conjecture again that the induced GO model is highly multi-extremal. The global 
search based numerical solution has an RMSE ~ 0.00493399; the corresponding runtime 
is 76.16 seconds (1,053,014 function evaluations, i.e. ~ 13,800 evaluations per second). 
For comparison, the function (11) is displayed below, directly followed by the list plot 
produced by the trained ANN, see Figure 3. Again, the reproduction of the target function 
seems to be quite satisfactory, both numerically and visually. 
 

Figure 3 Function (11), and the list plot produced by the optimized ANN. 
 
Example 3 
 
 f(x)=sin(5πx) sin(7πx) sin(11πx) log(1+x) 0≤x≤1.         (12) 
 
In (12), log denotes the natural logarithm function. 

As above, first we use T=100, J=10, and MOP’s MSLS search mode with 106 global 
search steps; furthermore, we extend the search region to [-50, 50] for each parameter. 
The global search based numerical solution has an RMSE ~ 0.0353158 which is 
noticeably not as good as the previous RMSEs; the corresponding runtime is 82.50 
seconds. For comparison, the function (12) is displayed below, directly followed by the 
list plot produced by the trained ANN. In the corresponding Figure 4, one can notice a 
more apparent discrepancy in the “low-amplitude” section of the function, estimated by 
simple inspection as the interval [0, 0.4], while the approximation is quite satisfactory in 
the remaining “higher-amplitude” interval [0.4, 1]. 
 

0.2 0.4 0.6 0.8 1.0

-0.6

-0.4

-0.2

0.2

0.4

0.6

20 40 60 80 100

-0.6

-0.4

-0.2

0.2

0.4

0.6


Figure 4 Function (12), and the list plot produced by a suboptimal ANN. 
 
For comparison, we also ran MOP in its (only) LS mode; the resulting standard error 

is RMSE ~ 0.0617162; the runtime is 28.19 seconds. Figure 5, corresponding to this case, 
visibly becomes a much more crude approximation of (12) than the list plot in Figure 4. 
Again, this finding illustrates the typical need for a global scope model fitting approach. 
 

Figure 5 Function (12), and the list plot produced by the locally optimized ANN. 
 
Example 4 
 

In several subsequent runs, we increased the required precision of the approximation 
to function (12) by the ANN, by increasing one or several of the following: the number of 
training data T, the number of processing units J, and the solver runtime allocated to 
MOP. Obviously, all the above measures could lead to improved results.  

Without going into unnecessary further details, Figure 6 (as an example) illustrates 
that e.g. the combination T=200, J=20, gss=3,000,000, and the additionally set runtime 
limit of 900 seconds allows the approximation of the entire function rather faithfully. 
This finding is also indicated by the corresponding RMSE ~ 0.0111696. Let us remark 
that this MOP run was automatically interrupted at the set time limit: arguably, without 
this limit even better approximations could be found. 
 

0.2 0.4 0.6 0.8 1.0

-0.2

-0.1

0.1

0.2

0.3

0.4

20 40 60 80 100

-0.2

-0.1

0.1

0.2

0.3

0.4

0.2 0.4 0.6 0.8 1.0

-0.2

-0.1

0.1

0.2

0.3

0.4

20 40 60 80 100

-0.2

-0.1

0.1

0.2

0.3

0.4


Figure 6 Function (12), and the list plot produced by the globally optimized ANN. 

 
To summarize our illustrative numerical experiments, we can say that the GO based 

ANN calibration approach can be used to handle non-trivial function approximation 
problems within reasonable computational effort (on today’s personal computers). Let us 
recall that ‒ in light of the theoretical results cited earlier ‒ exact function representations 
are in general too much to ask for. 

Our approach is directly applicable to solve at least certain types of multi-input and 
multi-response approximation problems. If the input sequences need to be aggregated 
then this will lead to increase GO model dimensionality. However, if the input data 
sequences can be processed by independently operated hidden layers for each function 
output, then our approach can be applied iteratively (repetitively) to such multi-response 
problems. Hence, in case of fully decoupled ANN models, the corresponding sub-
problems can be solved independently, thereby avoiding the (expected) exponential 
increase of computational complexity. Notwithstanding the above, one should be careful 
to avoid overly optimistic blanket statements when handling general multiple-input 
multiple-output problems by ANNs. 
 
 
6. Conclusions 
 

Due to its “universal approximator” capability, the ANN computational structure 
flexibly supports the modeling of many important research problems and real world 
applications that can be described by continuous functions defined over compact sets. 
ANN application areas includes data classification and pattern recognition (Ripley, 
1996), damage detection and earthquake simulation (Pei et al., 2006), function 
approximation (Toh, 1999; Ye and Lin, 2003), material science (Bhadeshia, 1999), 
experimental design of engineering systems (Röpke et al., 2005), nonlinear optimization 
(Malek et al., 2010), polypeptide structure prediction (Dorn and de Souza, 2010), 
prediction of trading signals of stock market indices (Tilakaratne et al., 2008), regression 
analysis (De Veux et al., 1998), signal and image processing  (Watkin, 1993; Masters, 
1994), time series analysis and forecasting (Franses and van Dijk, 2000; Kajitani et al., 
2005). 

At the same time, the ANN computational model itself has an obvious meta-heuristic 
flavor in the sense that it needs to be instantiated and calibrated for each new model type. 
Even if such problem-specific ANN design can benefit from the expertise of domain 
specialists, manual design can become truly difficult when problem complexity increases. 

0.2 0.4 0.6 0.8 1.0

-0.2

-0.1

0.1

0.2

0.3

0.4

50 100 150 200

-0.2

-0.1

0.1

0.2

0.3

0.4


Furthermore, once the ANN structure is selected, the computational effort may still need 
adjustments to handle the application satisfactorily. These practical aspects require 
flexible and transparent ANN implementations. Our current research code set up in an 
interpreted computing environment (Mathematica) is an illustrative example of such an 
implementation. After defining the function to be approximated, one needs to change 
only a few input data to produce a new sampling plan, to define the ANN instance 
(within a postulated ANN family), and to activate the optimization engine. The 
corresponding results can be then directly observed and visualized (after a few seconds or 
minutes in the examples presented here, using an average capability PC).  

A focal point of the present study has been to demonstrate the essential requirement 
of high-quality ANN model calibration. To meet this challenge, we apply a global 
optimization (GO) based model fitting strategy. Our approach is based on theoretically 
sound GO methodology, and the corresponding software implementations are capable of 
handling non-trivial function approximation problems in reasonable time. The presented 
illustrative examples directly indicate the applicability of the GO based approach e.g. to 
pattern or symbol recognition, image or signal processing, and time series analysis. 

In accordance with the theoretically exponential complexity of GO, the numerical 
demand of model fitting exercises can be expected to increase considerably with the size 
of the induced optimization model. This specifically includes ANN fitting problems in 
higher dimensions, and the consideration of general (possibly complicated) constraints 
regarding feasible parameter settings. We plan to investigate these issues in forthcoming 
studies, in relation to practically motivated problems that can be modeled by ANNs.  
 

Acknowledgements 
 
The author wishes to acknowledge the valuable contributions Dr. Frank Kampas to the 
development of the MathOptimizer Professional software product.  Thanks are due also 
to Dr. Maria Battarra for her comments on the manuscript. 
 

References  
 

Abraham, A. (2004) Meta learning evolutionary artificial neural networks. �eurocomputing 56, 1-38. 
Ali, M.M., Khompatraporn, C. and Zabinsky, Z.B. (2005) A numerical evaluation of several stochastic 

algorithms on selected continuous global optimization test problems. Journal of Global Optimization 
31, 635–672. 

Auer, P. Herbster, M. Warmuth, M. (1996) Exponentially many local minima for single neurons. In: 
Touretzky, D. et al., (Eds.), Advances in �eural Information Processing Systems, Vol. 8, pp. 316–322. 
MIT Press, Cambridge, MA. 

Bar-Yam, Y. (1999). Dynamics of Complex Systems. Westview Press, Boulder, CO. 
Bhadeshia, H.K.D.H. (1999) Neural networks in materials science, ISIJ International, 39 (10), 966-979. 
Bianchini, M. and Gori, M. (1996). Optimal learning in artificial neural networks: A review of theoretical 

results. �eurocomputing 13, 313-346. 
Çağlayan, M.O. and Pintér, J.D. (2010) Development and calibration of currency market strategies by 

global optimization. Research Report, Özyeğin University, Istanbul. (Submitted for publication) 
Castillo, I., Kampas, F.J., and Pintér, J.D. (2008) Solving circle packing problems by global optimization: 

numerical results and industrial applications. European Journal of Operational Research 191, 786–
802. 


Cichocki, A. and Unbehauen, R. (1993) �eural �etworks for Optimization and Signal Processing. John 
Wiley & Sons, Chichester, UK. 

Cruse, H. (2006) �eural �etworks as Cybernetic Systems (2nd and Revised Edition). Brains, Minds and 
Media, Bielefeld, Germany. 

Cybenko, G. (1989) Approximation by superpositions of a sigmoidal function. Mathematics of Control, 
Signals, and Systems 2, 303-314. 

De Veux, R.D., Schumi, J., Schweinsberg, J., and Ungar, L.H. (1998) Prediction intervals for neural 
networks via nonlinear regression. Technometrics 40, 273–282. 

Dorn, M., and de Souza, O. N. (2010) A3N: An artificial neural network N-gram-based method to 
approximate 3-D polypeptides structure prediction. Expert Systems with Applications (forthcoming), 
doi:10.1016/j.eswa.2010.04.096. 

Franses, P.H., van Dijk, D. (2000) �onlinear Time Series Models in Empirical Finance. Cambridge 
University Press, Cambridge, UK. 

Funahashi, K. (1989) On the approximate realization of continuous mappings by neural networks. �eural 
�etworks 2, 183-192. 

Golden, R.M. (1996) Mathematical Methods for �eural �etwork Analysis and Design, MIT Press, 
Cambridge, MA. 

Goossens, P., McPhee, J., Schmitke, C., Pintér, J.D., and Stahl, H. (2007) Driving innovation: how 
mathematical modeling and optimization increase efficiency and productivity in vehicle design. 
Technical Memorandum, Maplesoft, Waterloo, ON. 

Hamm, L., Brorsen, B.W. and Hagan, M. T. (2007) Comparison of stochastic global optimization methods 
to estimate neural network weights. �eural Processing Letters 26, 145–158. 

Hertz, J., Krogh, A., and Palmer, R.G. (1991) Introduction to the Theory of �eural Computation. Addison-
Wesley, Reading, MA. 

Hornik, K., Stinchcombe, M., and White, H. (1989) Multilayer feedforward networks are universal 
approximators. �eural �etworks 2, 359–366. 

Horst, R. and Pardalos, P.M., Eds. (1995) Handbook of Global Optimization, Volume 1. Kluwer Academic 
Publishers, Dordrecht. 

Isenor, G., Pintér, J.D. and Cada, M. (2003) A global optimization approach to laser design. Optimization 
and Engineering 4, 177-196. 

Jordanov, I. and Brown, R. (1999) Neural network learning using low-discrepancy sequence. In Foo, N., 
Ed., AI’99, Lecture �otes in Artificial Intelligence 1747, pp. 255-267. Springer-Verlag, Berlin 
Heidelberg. 

Kajitani, Y., Hipel, K.W., and McLeod, A.I. (2005) Forecasting nonlinear time series with feed-forward 
neural networks: a case study of Canadian lynx data. Journal of Forecasting 24, 105–117. 

Kampas, F.J. and Pintér, J.D. (2006) Configuration analysis and design by using optimization tools in 
Mathematica. The Mathematica Journal 10 (1), 128-154. 

Kampas, F.J. and Pintér, J.D. (2009) MathOptimizer: A nonlinear optimization package for Mathematica 
users. Research Report, Özyeğin University, Istanbul. (Submitted for publication) 

Khompatraporn, Ch., Pintér, J.D. and Zabinsky, Z.B. (2005) Comparative assessment of algorithms and 
software for global optimization. Journal of Global Optimization, 31, 613–633. 

Kolmogorov, A. N. (1957). On the representation of continuous functions of many variables by 
superposition of continuous functions of one variable and addition. Doklady Akademii �auk SSR, 114, 
953-956. 

Lang, B. (2005) Monotonic multi-layer perceptron networks as universal approximators. In: Duch, W. et 
al., Eds., Artificial �eural �etworks: Formal Models and Their Applications - ICA�� 2005, pp. 31-37. 
Lecture �otes in Computer Science 3697, Springer-Verlag Berlin Heidelberg. 

Malek, A., Hosseinipour-Mahania, N., and Ezazipoura, S. (2010) Efficient recurrent neural network model 
for the solution of general nonlinear optimization problems. Optimization Methods & Software 25 (4), 
489–506. 

Mason, T.L., Emelle, C., van Berkel, J., Bagirov A.M., Kampas, F. and Pintér, J.D. (2007) Integrated 
production system optimization using the Lipschitz Global Optimizer and the Discrete Gradient 
Method.  Journal of Industrial and Management Optimization 3 (2), 257-277. 

Masters, T. (1994) Signal and Image Processing with �eural �etworks, John Wiley & Sons, New York. 
Mehrotra, K., Mohan, C.K., and Ranka, S. (1997) Elements of Artificial �eural �ets. MIT Press, 

Cambridge, MA. 


Pantoleontos, G., Basinas, P., Skodras, G., Grammelis, P., Pintér, J.D., Topis, S., and Sakellaropoulos, G.P. 
(2009) A global optimization study on the devolatilisation kinetics of coal, biomass and waste fuels. 
Fuel Processing Technology 90, 762–769. 

Pardalos, P.M. and Romeijn, H.E., Eds. (2002) Handbook of Global Optimization, Volume 2. Kluwer 
Academic Publishers, Dordrecht. 

Park, J. and Sandberg, I.W. (1991) Universal approximation using radial-basis-function networks. �eural 
Computation 3 (2), 246-257. 

Park, J. and Sandberg, I.W. (1993) Approximation and radial-basis-function networks. �eural Computation 
5 (2), 305-316. 

Pei, J-S., Mai, E., and Piyawat, K. (2006) Multilayer feedforward neural network initialization 
methodology for modeling nonlinear restoring forces and beyond. 4WCSCM-306, 4th World 
Conference on Structural Control and Monitoring, San Diego, CA, July 2006. 

Pintér, J.D. (1996) Global Optimization in Action. Kluwer Academic Publishers, Dordrecht.  
Pintér, J.D. (1997) LGO ‒ A program system for continuous and Lipschitz optimization. In: Bomze, I.M., 

Csendes, T., Horst, R. and Pardalos, P.M., Eds. Developments in Global Optimization, pp. 183-197. 
Kluwer Academic Publishers, Dordrecht. 

Pintér, J.D. (2002) Global optimization: software, test problems, and applications. In: Pardalos, P.M. and 
Romeijn, H.E., Eds. Handbook of Global Optimization, Volume 2, pp. 515-569. Kluwer Academic 
Publishers, Dordrecht. 

Pintér, J.D. (2003) Globally optimized calibration of nonlinear models: techniques, software, and 
applications. Optimization Methods and Software 18 (3) 335-355. 

Pintér, J.D. (2005) Nonlinear optimization in modeling environments: software implementations for 
compilers, spreadsheets, modeling languages, and integrated computing systems. In: Jeyakumar, V. 
and Rubinov, A.M., Eds., Continuous Optimization: Current Trends and Modern Applications, pp. 
147-173. Springer Science + Business Media, New York. 

Pintér, J.D. (2007) Nonlinear optimization with GAMS/LGO. Journal of Global Optimization 38, 79-101. 
Pintér, J.D. (2009) Software development for global optimization. In: Pardalos, P.M. and T. F. Coleman, 

eds. Global Optimization: Methods and Applications, pp. 183-204. Fields Institute Communications 
Volume 55. Published by the American Mathematical Society, Providence, RI. 

Pintér, J.D. (2010a) LGO ─ A Model Development and Solver System for �onlinear (Global and Local) 
Optimization. User’s Guide. (Current version) Distributed by Pintér Consulting Services, Inc., Canada; 
www.pinterconsulting.com. 

Pintér, J.D. (2010b) Spreadsheet-Based Modeling and �onlinear Optimization with Excel-LGO. User’s 
Guide. With technical contributions by B.C. Şal and F.J. Kampas. (Current version) Distributed by 
Pintér Consulting Services, Inc., Canada. www.pinterconsulting.com. 

Pintér, J.D. and Kampas, F.J. (2005) Nonlinear optimization in Mathematica with MathOptimizer 
Professional. Mathematica in Education and Research 10 (2), 1-18. 

Pintér, J.D., Linder, D., and Chin, P. (2006) Global Optimization Toolbox for Maple: an introduction with 
illustrative applications. Optimization Methods and Software 21 (4), 565-582. 

Prechelt, L. (1994) PROBEN1 − A set of benchmarks and benchmarking rules for neural network training 
algorithms. Technical Report 21/94, Fakultat fur Informatik, Universitat Karlsruhe, Germany, 
September 1994.  

Ripley, B.D. (1996) Pattern Recognition and �eural �etworks. Cambridge University Press, Cambridge, 
MA. 

Röpke, K. et al., Eds. (2005) DoE – Design of Experiments. Produced with the technical collaboration of 
IAV GmbH. Verlag Moderne Industrie. © sv corporate media GmbH, Munich, Germany.  

Sexton, R.S., Alidaee, B., Dorsey, R.E. and Johnson, J.D. (1998) Global optimization for artificial neural 
networks: A tabu search application. European Journal of Operational Research 106 (1998) 570-584. 

Smith, M. (1993) �eural �etworks for Statistical Modeling, Van Nostrand Reinhold, New York. 
Steeb, W-H. (2005) The �onlinear Workbook. (3rd Edition) World Scientific, Singapore. 
Sutskever, I. and Hinton, G.E. (2008) Deep, narrow sigmoid belief networks are universal approximators. 

�eural Computation 20, 2629–2636. 
Tervo, J., Kolmonen, P., Lyyra-Laitinen, T., Pintér, J.D., and Lahtinen, T. (2003) An optimization-based 

approach to the multiple static delivery technique in radiation therapy. Annals of Operations Research 
119, 205-227. 


Tilakaratne, Ch.D., Mammadov, M.A. and Morris, S.A. (2008) Predicting trading signals of stock market 
indices using neural networks.  In: Wobcke, W. and Zhang, M., Eds., AI 2008: Advances in Artificial 
Intelligence, pp. 522-531. Lecture �otes in Computer Science 5360, Springer-Verlag Berlin 
Heidelberg, 2008. 

Toh, K-A. (1999) Fast deterministic global optimization for FNN training. In: Proceedings of IEEE 
International Conference on Systems, Man, and Cybernetics, Tokyo, Japan, pp. V413 - V418.  

Watkin, T. L. H. (1993) Optimal learning with a neural network. Europhysics Letters 21 (8), 871-876. 
Wikipedia (2010) Artificial neural network. http://en.wikipedia.org/wiki/Artificial_neural_network. 
Wolfram Research (2009) Mathematica (Version 7.0.1). Distributed by WRI, Champaign, IL. 
Ye, H. and Lin, Z. (2003) Global optimization of neural network weights using subenergy tunneling 

function and ripple search. Proc. IEEE ISCAS, Bangkok, Thailand, May 2003, V725-728.