SlideShare a Scribd company logo
1 of 4
Download to read offline
Optimising the Widths of Radial Basis Functions
                                              Mark Orr
                                         mark@cns.ed.ac.uk
                         Centre for Cognitive Science, Edinburgh University
                      2, Buccleuch Street, Edinburgh EH8 9LW, Scotland, UK

                     Abstract                             too complex for the data. The size of the penalty is
                                                          controlled by , the regularisation parameter, which,
   In the context of regression analysis with penalised   like w and r, is free to adapt to the training set. Given
linear models (such as RBF networks) certain model        a value for the weight vector which minimises the
selection criteria can be di erentiated to yield a re-    cost is
estimation formula for the regularisation parameter                            w = A;1H>y
                                                                               ^
such that an initial guess can be iteratively improved    where Hij = hj (xi ) is the design matrix and contains
until a local minimum of the criterion is reached. In     the responses of the m centres to the p inputs of the
this paper we discuss some enhancements of this gen-      training set, A = H>H + Im and Im is the m-
eral approach including improved computational e -        dimensional identity matrix.
ciency, detection of the global minimum and simulta-         The adjustable parameters in the model are the
neous optimisation of the basis function widths. The      m weights wj , the basis function width r and the
bene ts of these improvements are demonstrated on a       regularisation parameter . The xed parameters are
practical problem.                                        the centre positions cj and their number m. Below
                                                          we will assume that the inputs of the training set are
                                                          used as the xed centres, in which case m = p and
1 Introduction                                            cj = xj , but our results apply equally to other choices
                                                          of xed centres.
   Consider a radial basis function (RBF) network            Various model selection criteria, such as gener-
with centres at fcj g, weights fwj g and radial func-     alised cross-validation (GCV) 2] or the marginal like-
tions                                                     lihood of the data (the evidence") 3] can be di er-
                                                          entiated and set equal to zero to yield a re-estimation
       h (x) = exp ; (x ; cj ) (x ; cj )
                                 >
         j                                                formula for the regularisation parameter. For exam-
                                     r2                   ple in a previous paper 5] we derived the following
                                                          formula from GCV
(j = 1 : : : m) all having the same width r. The cen-
                                                                              = p ; w>^ ;1 w
                                                                                          e^ e
tres are xed but the weights and width are adapt-                                          >
able. The response of the network to an input x is                                      ^ A ^                (1)
                          m
                          X                               where ^ = y ; H w, = m ; tr A;1 (the e ective
                                                                 e            ^                 ;
                f (x) =          wj hj (x)                number of parameters 4]) and = tr A;1 ; A;2 .
                          j =1                               However there are problems with simply trying to
Suppose that this network is trained on a regression      iterate equation (1) to convergence. Firstly, depend-
data set fxi yi g (i = 1 : : : p) by minimising the a     ing on the initial guess, a non-optimal local minimum
penalised sum-squared-error cost function                 may be found and secondly, inversion of A is liable
                                                          to become numerically unstable if gravitates to-
               C (w) = e> e + w> w                        wards very small values. Furthermore, the value of
                                                          the width parameter r remains xed. In the next
where w is the m-dimensional weight vector and e is       section we describe a computationally e cient and
the p-dimensional error vector, ei = yi ; f (xi ). The    numerically stable method of iterating (1) which is
second term penalises large weights and is designed       fast enough that the global minimum and the opti-
to avoid over t should the unregularised model be         mal value of r can be found by explicit search.
2 E cient Computation                                    RBF network with m = 60 centres coincident with
                                                         the input points and basis functions of xed width
   Continually recomputing the inverse of A each         r = 0:2. Figure 1 shows the variation of GCV with
time the value of changes requires of order m3           for one particular realisation of this problem.
  oating point operations per iteration and is vulner-
able to numerical instability if becomes very small.
A more e cient and stable method involves initially
computing the eigenvalues f i g and eigenvectors fui g
of HH> and fzi g, the projections of the data onto
the eigenvectors (zi = y> ui ). Thereafter, the four
terms appearing in (1) can be computed e ciently




                                                            log(GCV)
(with cost only linear in p) by
                            p
                            X
                 e>e    =              zi
                                      2 2
                                                  (2)
                            i=1   ( i + )2
                            Xp
           w>A;1 w      =             i zi2       (3)
                          i=1 ( i + )
                                     3


                          Xp                                           −16          −10              −4     2
                        =          i              (4)                                       log(λ)
                          i=1 ( i + )2
                          Xp                               Figure 1: Local (diamonds) and global (star)
                p;      =                         (5)      minima of GCV.
                            i=1 i +
   Note that if p > m then the last p ; m eigenvalues       An array of 50 trial values of , evenly spaced be-
(assuming they are ordered from largest to smallest)     tween log = ;16 and log = 2, were used to nd
are zero. However, as remarked earlier, if we have one   rough positions for the local minima and equation (1)
centre for each training set input then p = m and the    was then iterated for each one found. Convergence
cost of calculating the eigenvalues and eigenvectors     was assumed once changes in GCV from one itera-
of the p p matrix HH> is of the same order as            tion to the next had dipped below a threshold of 1
inverting the m m matrix A. Therefore, unless            part in a million. In the example problem three local
(1) converges almost immediately, it is much more        minima were detected (see gure 1) and the one with
e cient to calculate the eigensystem once and then       the lowest GCV corresponded to          2. Searching
use (2-5) than to invert A on each iteration.            for the minima and re ning the candidate solutions
   Once the eigensystem has been established, GCV        took up only 0.6% of the total computation time, the
                                                         rest was accounted for by the calculation of eigenval-
                 GCV = (p ; ^ 2
                        p^ e
                          e       >                      ues and eigenvectors. Notice that if we had simply
                            )                            started with a single guess for and iterated equa-
                                                         tion (1) to nd the solution, any initial guess below
can also be cheaply calculated for any given us-         about 10;4 would have led to a sub-optimal solution.
ing (2,5). Thus it is feasible to evaluate GCV for a        Occasionally the value of re-estimated from equa-
number of trial values of searching for local min-       tion (1) bounces back and forward between two val-
ima, re ne those that are found by iterating (1) to      ues on each side of a local minimum and then ei-
convergence { using (2-5) of course { and nally se-      ther takes a long time to pass through this bistable
lect from the local minima the one with the smallest     state before nally converging or does not converge
GCV. Assuming a wide and dense enough range of           at all. To solve this problem we devised the follow-
trial values is employed, this procedure will nd the     ing heuristic. Suppose the sequence of re-estimated
global minimum.                                          values is 1 : : : k;2 k;1 k , with k being the
   We now demonstrate this method on a toy prob-         current value. Then if
lem consisting of p = 60 samples taken from the func-                        j k;   k;1 j   > j k ; k;2 j
tion y = 0:8 sin(6 x) at random points in the range
0 < x < 1 and corrupted by Gaussian noise of stan-       replace k by the geometric mean of k;1 and k;2
dard deviation 0.1. The data was modelled by an          before proceeding to the next iteration.
3 Optimising the Width                                      Usually this means the location of the global mini-
                                                            mum also changes smoothly with r but there are par-
   When GCV is di erentiated with respect to the            ticular values of the width where the identity of the
regularisation parameter and set equal to zero the          local minima with the smallest GCV switches, caus-
resulting equation can be manipulated so that alone         ing an abrupt change in location (but not height) of
appears on the left hand side, enabling the equation to     the global minimum. This explains the discontinuous
be used as a re-estimation formula. Unfortunately the       changes of slope in the curve of gure 2. Local min-
same trick does not work with r because, after setting      ima can also be created or destroyed as r changes, so
the derivative of GCV with respect to r to zero, the        discontinuous changes in value are also possible.
terms explicitly involving r cancel so r cannot be iso-        Of course, the ultimate arbiter of generalisation
lated and a re-estimation formula is impossible. The        performance is not the value of a model selection cri-
same applies to other model selection criteria such as      terion (such as GCV) on a particular realisation of
maximum likelihood of the data 6].                          the problem but the error of an independent test set
   When there is only one width parameter, as we as-        averaged over multiple realisations. We perform such
sume here, it is feasible to tackle the problem of choos-   a test in the next section.
ing an optimal value by experimenting with a number
of trial values and selecting the one most favoured by
the model selection criterion. The range of trial values
                                                            4 Results
used will be problem speci c and could be determined            For a thorough test of the method we turn to a
by the likely maximum and minimum scales involved           more realistic problem stemming from Friedmann's
in the particular problem. The number of trial values       MARS paper 1] and later used to compare RBFs
between the these limits will depend on the size of the     and MARS 5]. The problem involves the prediction
problem (p) and the available computing resources           of impedance Z and phase from the four parameters
since for each trial value an eigensystem computation       (resistance, frequency, inductance and capacitance) of
(with cost proportional to p3 ) will be necessary.          an electrical circuit. Training sets of three di erent
                                                            sizes (100, 200, 400) and with a signal-to-noise ra-
                                                            tio of about 3:1 were replicated 100 times each. The
                                                            input components were normalised to have unit vari-
                                                            ance and zero mean for each replication. The learning
                                                            method, as described above, was applied using a set
                                                            of 10 trial values of r between 1 and 10. Generali-
                                                            sation performance was estimated by scaled sum of
                                                            squared errors over two independent test sets (one
    GCV




                                                            for Z and one for ) of size 5000 and uncorrupted by
                                                            noise. This is the same experimental set up as in the
                                                            previous papers 1, 5] from which further details can
                                                            be obtained.
                                                                                 Z
                                                                     p NEW OLD NEW OLD
          0     0.2      0.4       0.6   0.8      1                 100 0.34 0.45 0.27 0.26
                               r                                    200 0.19 0.26 0.18 0.20
                                                                    400 0.14 0.14 0.13 0.16
  Figure 2: Tracking the global minimum with
  respect to as r changes.                                    Table 1: Average generalisation errors for the
                                                              new method, which optimises the width r, and
                                                              an older method which does not.
   Figure 2 illustrates using the toy problem described
earlier. It shows the value of GCV at the global min-
imum over for 50 trial values of r between 0.1 and             Table 1 summarises the results. The left hand col-
1.0. The value of r = 0:2, which we used earlier, ap-       umn gives training set size. Two sets of results, one
pears to have been a little on the small side. The          for Z and one for , are given. The gures quoted are
optimal value is close to 0.45.                             the average (over 100 replications) of the scaled sum
   As r changes the location ( ) and height (GCV)           of squared prediction errors. Apart from the method
of the local minima (see gure 1) change smoothly.           described above, which involves optimisation of r, the
average errors of an older RBF algorithm, regularised     5 Conclusions
forward selection (RFS), are also quoted (taken from
 5]). The main di erences to the method described            We have described a new computational method
here are that RFS uses a xed value of r and creates       for re-estimating the regularisation parameter of an
a parsimonious network. The latter has a relatively       RBF network based on generalised cross-validation
small a ect on generalisation performance.                (GCV). It utilises an eigensystem related to the de-
   RFS is clearly inferior to the new method for the      sign matrix of the regression problem and is more
Z problem and marginally worse for . We think             e cient and more stable than methods which involve
the optimisation of r for each training set explains      a direct matrix inverse at each iteration. We have ex-
the superior performance of the new method and the        tended the algorithm to optimise the basis function
lack of such optimisation is a partial explanation for    width simply by testing a number of trial values and
the poor performance of RFS compared to MARS 5].          selecting the one associated with the smallest value
The xed value of r used for RFS was 3.5 but the av-       of GCV.
erage optimal values determined by the new method            We tested the method on a practical problem in-
were 8.7 for Z and 2.8 for . Thus it looks as if the      volving 4 input dimensions and a few hundred train-
  xed value used for RFS was an underestimate in the      ing examples. Our method, which can adapt the
case of Z (where the new algorithm considerably im-       width of the basis functions, but not their number,
proved the results) but about right for (where the        was found to have better prediction performance than
new method made less of an impact).                       a similar RBF network which can adapt the number
                                                          of functions but is stuck with the same xed width.
                                                             The new method, with its head-on approaches to
                                                            nding the global minimum with respect to the regu-
                                                          larisation parameter and to optimising the basis func-
                                                          tion width, does not scale-up well for multiple regular-
                                                          isation parameters or multiple widths. Additionally,
                                                          there is a limit on how many training examples and
                                                          basis functions can be handled due to the computa-
                                                          tional cost of calculating the eigensystem. It is best
   Z                                                      suited to problems involving a single regularisation
                                                          parameter, a single basis function width and about
                                                          1000 (or less) training set examples.

                                                          References
    2
                                                2         1] J. Friedman. Multivariate adaptive regression splines
             0                                               (with discussion). Annals of Statistics, 19:1{141, 1991.
                                     0
                                                          2] G. Golub, M. Heath, and G. Wahba. Generalised
              L     −2 −2          C                         cross-validation as a method for choosing a good ridge
                                                             parameter. Technometrics, 21(2):215{223, 1979.
        Figure 3: Z as a function of L and C.             3] D. MacKay. Bayesian interpolation. Neural Compu-
                                                             tation, 4(3):415{447, 1992.
                                                          4] J. Moody. The e ective number of parameters: An
                                                             analysis of generalisation and regularisation in non-
   Note that while r = 8:7 may sound rather large, es-       linear learning systems. In J. Moody, S. Hanson, and
pecially in view of the normalised input components,         R. Lippmann, editors, Neural Information Processing
such large basis function widths do not necessarily im-      Systems 4, pages 847{854. Morgan Kaufmann, San
ply a lack of structure in the tted function, as might       Mateo CA, 1992.
                                                          5] M. Orr. Regularisation in the selection of radial basis
be assumed. Figure 3 plots Z (impedance) against C           function centres. Neural Computation, 7(3):606{623,
(capacitance) and L (inductance) for xed values of           1995.
the other two components (resistance and frequency).      6] M. Orr. An EM algorithm for regularised radial ba-
This function was tted to one of the p = 200 train-          sis function networks. In International Conference on
ing sets for which the algorithm had found an optimal        Neural Networks and Brain, Beijing, China, October
basis function width of r = 10. The function still ex-       1998.
hibits considerable structure over the ranges of L and
C even though they are less than half the size of r.

More Related Content

What's hot

(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi DivergenceMasahiro Suzuki
 
Divergence center-based clustering and their applications
Divergence center-based clustering and their applicationsDivergence center-based clustering and their applications
Divergence center-based clustering and their applicationsFrank Nielsen
 
Classification with mixtures of curved Mahalanobis metrics
Classification with mixtures of curved Mahalanobis metricsClassification with mixtures of curved Mahalanobis metrics
Classification with mixtures of curved Mahalanobis metricsFrank Nielsen
 
CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtures
CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: MixturesCVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtures
CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtureszukun
 
Tailored Bregman Ball Trees for Effective Nearest Neighbors
Tailored Bregman Ball Trees for Effective Nearest NeighborsTailored Bregman Ball Trees for Effective Nearest Neighbors
Tailored Bregman Ball Trees for Effective Nearest NeighborsFrank Nielsen
 
Convex Optimization Modelling with CVXOPT
Convex Optimization Modelling with CVXOPTConvex Optimization Modelling with CVXOPT
Convex Optimization Modelling with CVXOPTandrewmart11
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 
bayesImageS: Bayesian computation for medical Image Segmentation using a hidd...
bayesImageS: Bayesian computation for medical Image Segmentation using a hidd...bayesImageS: Bayesian computation for medical Image Segmentation using a hidd...
bayesImageS: Bayesian computation for medical Image Segmentation using a hidd...Matt Moores
 
cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and...
cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and...cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and...
cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and...zukun
 
Aaex4 group2(中英夾雜)
Aaex4 group2(中英夾雜)Aaex4 group2(中英夾雜)
Aaex4 group2(中英夾雜)Shiang-Yun Yang
 
Q-Metrics in Theory And Practice
Q-Metrics in Theory And PracticeQ-Metrics in Theory And Practice
Q-Metrics in Theory And Practiceguest3550292
 
A nonlinear approximation of the Bayesian Update formula
A nonlinear approximation of the Bayesian Update formulaA nonlinear approximation of the Bayesian Update formula
A nonlinear approximation of the Bayesian Update formulaAlexander Litvinenko
 

What's hot (20)

Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence
 
Divergence center-based clustering and their applications
Divergence center-based clustering and their applicationsDivergence center-based clustering and their applications
Divergence center-based clustering and their applications
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Appli...
 Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Appli... Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Appli...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Appli...
 
Classification with mixtures of curved Mahalanobis metrics
Classification with mixtures of curved Mahalanobis metricsClassification with mixtures of curved Mahalanobis metrics
Classification with mixtures of curved Mahalanobis metrics
 
CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtures
CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: MixturesCVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtures
CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtures
 
Tailored Bregman Ball Trees for Effective Nearest Neighbors
Tailored Bregman Ball Trees for Effective Nearest NeighborsTailored Bregman Ball Trees for Effective Nearest Neighbors
Tailored Bregman Ball Trees for Effective Nearest Neighbors
 
Convex Optimization Modelling with CVXOPT
Convex Optimization Modelling with CVXOPTConvex Optimization Modelling with CVXOPT
Convex Optimization Modelling with CVXOPT
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
bayesImageS: Bayesian computation for medical Image Segmentation using a hidd...
bayesImageS: Bayesian computation for medical Image Segmentation using a hidd...bayesImageS: Bayesian computation for medical Image Segmentation using a hidd...
bayesImageS: Bayesian computation for medical Image Segmentation using a hidd...
 
cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and...
cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and...cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and...
cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and...
 
Jere Koskela slides
Jere Koskela slidesJere Koskela slides
Jere Koskela slides
 
Aaex4 group2(中英夾雜)
Aaex4 group2(中英夾雜)Aaex4 group2(中英夾雜)
Aaex4 group2(中英夾雜)
 
Q-Metrics in Theory And Practice
Q-Metrics in Theory And PracticeQ-Metrics in Theory And Practice
Q-Metrics in Theory And Practice
 
Cb25464467
Cb25464467Cb25464467
Cb25464467
 
Ef24836841
Ef24836841Ef24836841
Ef24836841
 
A nonlinear approximation of the Bayesian Update formula
A nonlinear approximation of the Bayesian Update formulaA nonlinear approximation of the Bayesian Update formula
A nonlinear approximation of the Bayesian Update formula
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
MUMS Opening Workshop - Panel Discussion: Facts About Some Statisitcal Models...
MUMS Opening Workshop - Panel Discussion: Facts About Some Statisitcal Models...MUMS Opening Workshop - Panel Discussion: Facts About Some Statisitcal Models...
MUMS Opening Workshop - Panel Discussion: Facts About Some Statisitcal Models...
 

Similar to Multilayer Neural Networks

Q-Metrics in Theory and Practice
Q-Metrics in Theory and PracticeQ-Metrics in Theory and Practice
Q-Metrics in Theory and PracticeMagdi Mohamed
 
An Introduction to Elleptic Curve Cryptography
An Introduction to Elleptic Curve CryptographyAn Introduction to Elleptic Curve Cryptography
An Introduction to Elleptic Curve CryptographyDerek Callaway
 
Approximate Thin Plate Spline Mappings
Approximate Thin Plate Spline MappingsApproximate Thin Plate Spline Mappings
Approximate Thin Plate Spline MappingsArchzilon Eshun-Davies
 
Parallel Evaluation of Multi-Semi-Joins
Parallel Evaluation of Multi-Semi-JoinsParallel Evaluation of Multi-Semi-Joins
Parallel Evaluation of Multi-Semi-JoinsJonny Daenen
 
Minimum mean square error estimation and approximation of the Bayesian update
Minimum mean square error estimation and approximation of the Bayesian updateMinimum mean square error estimation and approximation of the Bayesian update
Minimum mean square error estimation and approximation of the Bayesian updateAlexander Litvinenko
 
Linear regression [Theory and Application (In physics point of view) using py...
Linear regression [Theory and Application (In physics point of view) using py...Linear regression [Theory and Application (In physics point of view) using py...
Linear regression [Theory and Application (In physics point of view) using py...ANIRBANMAJUMDAR18
 
Solving inverse problems via non-linear Bayesian Update of PCE coefficients
Solving inverse problems via non-linear Bayesian Update of PCE coefficientsSolving inverse problems via non-linear Bayesian Update of PCE coefficients
Solving inverse problems via non-linear Bayesian Update of PCE coefficientsAlexander Litvinenko
 
fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920
fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920
fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920Karl Rudeen
 
Efficient Solution of Two-Stage Stochastic Linear Programs Using Interior Poi...
Efficient Solution of Two-Stage Stochastic Linear Programs Using Interior Poi...Efficient Solution of Two-Stage Stochastic Linear Programs Using Interior Poi...
Efficient Solution of Two-Stage Stochastic Linear Programs Using Interior Poi...SSA KPI
 
Parametric time domain system identification of a mass spring-damper
Parametric time domain system identification of a mass spring-damperParametric time domain system identification of a mass spring-damper
Parametric time domain system identification of a mass spring-damperMidoOoz
 
MVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsMVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsElvis DOHMATOB
 

Similar to Multilayer Neural Networks (20)

MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
 
Talk iccf 19_ben_hammouda
Talk iccf 19_ben_hammoudaTalk iccf 19_ben_hammouda
Talk iccf 19_ben_hammouda
 
Q-Metrics in Theory and Practice
Q-Metrics in Theory and PracticeQ-Metrics in Theory and Practice
Q-Metrics in Theory and Practice
 
An Introduction to Elleptic Curve Cryptography
An Introduction to Elleptic Curve CryptographyAn Introduction to Elleptic Curve Cryptography
An Introduction to Elleptic Curve Cryptography
 
Approximate Thin Plate Spline Mappings
Approximate Thin Plate Spline MappingsApproximate Thin Plate Spline Mappings
Approximate Thin Plate Spline Mappings
 
10.1.1.630.8055
10.1.1.630.805510.1.1.630.8055
10.1.1.630.8055
 
Parallel Evaluation of Multi-Semi-Joins
Parallel Evaluation of Multi-Semi-JoinsParallel Evaluation of Multi-Semi-Joins
Parallel Evaluation of Multi-Semi-Joins
 
Minimum mean square error estimation and approximation of the Bayesian update
Minimum mean square error estimation and approximation of the Bayesian updateMinimum mean square error estimation and approximation of the Bayesian update
Minimum mean square error estimation and approximation of the Bayesian update
 
Linear regression [Theory and Application (In physics point of view) using py...
Linear regression [Theory and Application (In physics point of view) using py...Linear regression [Theory and Application (In physics point of view) using py...
Linear regression [Theory and Application (In physics point of view) using py...
 
Litvinenko nlbu2016
Litvinenko nlbu2016Litvinenko nlbu2016
Litvinenko nlbu2016
 
Solving inverse problems via non-linear Bayesian Update of PCE coefficients
Solving inverse problems via non-linear Bayesian Update of PCE coefficientsSolving inverse problems via non-linear Bayesian Update of PCE coefficients
Solving inverse problems via non-linear Bayesian Update of PCE coefficients
 
fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920
fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920
fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920
 
Project Paper
Project PaperProject Paper
Project Paper
 
Efficient Solution of Two-Stage Stochastic Linear Programs Using Interior Poi...
Efficient Solution of Two-Stage Stochastic Linear Programs Using Interior Poi...Efficient Solution of Two-Stage Stochastic Linear Programs Using Interior Poi...
Efficient Solution of Two-Stage Stochastic Linear Programs Using Interior Poi...
 
Parametric time domain system identification of a mass spring-damper
Parametric time domain system identification of a mass spring-damperParametric time domain system identification of a mass spring-damper
Parametric time domain system identification of a mass spring-damper
 
Bistablecamnets
BistablecamnetsBistablecamnets
Bistablecamnets
 
Design & Analysis of Algorithms Assignment Help
Design & Analysis of Algorithms Assignment HelpDesign & Analysis of Algorithms Assignment Help
Design & Analysis of Algorithms Assignment Help
 
MVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsMVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priors
 
The Gaussian Hardy-Littlewood Maximal Function
The Gaussian Hardy-Littlewood Maximal FunctionThe Gaussian Hardy-Littlewood Maximal Function
The Gaussian Hardy-Littlewood Maximal Function
 
PCA on graph/network
PCA on graph/networkPCA on graph/network
PCA on graph/network
 

More from ESCOM

redes neuronales tipo Som
redes neuronales tipo Somredes neuronales tipo Som
redes neuronales tipo SomESCOM
 
redes neuronales Som
redes neuronales Somredes neuronales Som
redes neuronales SomESCOM
 
redes neuronales Som Slides
redes neuronales Som Slidesredes neuronales Som Slides
redes neuronales Som SlidesESCOM
 
red neuronal Som Net
red neuronal Som Netred neuronal Som Net
red neuronal Som NetESCOM
 
Self Organinising neural networks
Self Organinising  neural networksSelf Organinising  neural networks
Self Organinising neural networksESCOM
 
redes neuronales Kohonen
redes neuronales Kohonenredes neuronales Kohonen
redes neuronales KohonenESCOM
 
Teoria Resonancia Adaptativa
Teoria Resonancia AdaptativaTeoria Resonancia Adaptativa
Teoria Resonancia AdaptativaESCOM
 
ejemplo red neuronal Art1
ejemplo red neuronal Art1ejemplo red neuronal Art1
ejemplo red neuronal Art1ESCOM
 
redes neuronales tipo Art3
redes neuronales tipo Art3redes neuronales tipo Art3
redes neuronales tipo Art3ESCOM
 
Art2
Art2Art2
Art2ESCOM
 
Redes neuronales tipo Art
Redes neuronales tipo ArtRedes neuronales tipo Art
Redes neuronales tipo ArtESCOM
 
Neocognitron
NeocognitronNeocognitron
NeocognitronESCOM
 
Neocognitron
NeocognitronNeocognitron
NeocognitronESCOM
 
Neocognitron
NeocognitronNeocognitron
NeocognitronESCOM
 
Fukushima Cognitron
Fukushima CognitronFukushima Cognitron
Fukushima CognitronESCOM
 
Counterpropagation NETWORK
Counterpropagation NETWORKCounterpropagation NETWORK
Counterpropagation NETWORKESCOM
 
Counterpropagation NETWORK
Counterpropagation NETWORKCounterpropagation NETWORK
Counterpropagation NETWORKESCOM
 
Counterpropagation
CounterpropagationCounterpropagation
CounterpropagationESCOM
 
Teoría de Resonancia Adaptativa Art2 ARTMAP
Teoría de Resonancia Adaptativa Art2 ARTMAPTeoría de Resonancia Adaptativa Art2 ARTMAP
Teoría de Resonancia Adaptativa Art2 ARTMAPESCOM
 
Teoría de Resonancia Adaptativa ART1
Teoría de Resonancia Adaptativa ART1Teoría de Resonancia Adaptativa ART1
Teoría de Resonancia Adaptativa ART1ESCOM
 

More from ESCOM (20)

redes neuronales tipo Som
redes neuronales tipo Somredes neuronales tipo Som
redes neuronales tipo Som
 
redes neuronales Som
redes neuronales Somredes neuronales Som
redes neuronales Som
 
redes neuronales Som Slides
redes neuronales Som Slidesredes neuronales Som Slides
redes neuronales Som Slides
 
red neuronal Som Net
red neuronal Som Netred neuronal Som Net
red neuronal Som Net
 
Self Organinising neural networks
Self Organinising  neural networksSelf Organinising  neural networks
Self Organinising neural networks
 
redes neuronales Kohonen
redes neuronales Kohonenredes neuronales Kohonen
redes neuronales Kohonen
 
Teoria Resonancia Adaptativa
Teoria Resonancia AdaptativaTeoria Resonancia Adaptativa
Teoria Resonancia Adaptativa
 
ejemplo red neuronal Art1
ejemplo red neuronal Art1ejemplo red neuronal Art1
ejemplo red neuronal Art1
 
redes neuronales tipo Art3
redes neuronales tipo Art3redes neuronales tipo Art3
redes neuronales tipo Art3
 
Art2
Art2Art2
Art2
 
Redes neuronales tipo Art
Redes neuronales tipo ArtRedes neuronales tipo Art
Redes neuronales tipo Art
 
Neocognitron
NeocognitronNeocognitron
Neocognitron
 
Neocognitron
NeocognitronNeocognitron
Neocognitron
 
Neocognitron
NeocognitronNeocognitron
Neocognitron
 
Fukushima Cognitron
Fukushima CognitronFukushima Cognitron
Fukushima Cognitron
 
Counterpropagation NETWORK
Counterpropagation NETWORKCounterpropagation NETWORK
Counterpropagation NETWORK
 
Counterpropagation NETWORK
Counterpropagation NETWORKCounterpropagation NETWORK
Counterpropagation NETWORK
 
Counterpropagation
CounterpropagationCounterpropagation
Counterpropagation
 
Teoría de Resonancia Adaptativa Art2 ARTMAP
Teoría de Resonancia Adaptativa Art2 ARTMAPTeoría de Resonancia Adaptativa Art2 ARTMAP
Teoría de Resonancia Adaptativa Art2 ARTMAP
 
Teoría de Resonancia Adaptativa ART1
Teoría de Resonancia Adaptativa ART1Teoría de Resonancia Adaptativa ART1
Teoría de Resonancia Adaptativa ART1
 

Recently uploaded

1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 

Recently uploaded (20)

1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 

Multilayer Neural Networks

  • 1. Optimising the Widths of Radial Basis Functions Mark Orr mark@cns.ed.ac.uk Centre for Cognitive Science, Edinburgh University 2, Buccleuch Street, Edinburgh EH8 9LW, Scotland, UK Abstract too complex for the data. The size of the penalty is controlled by , the regularisation parameter, which, In the context of regression analysis with penalised like w and r, is free to adapt to the training set. Given linear models (such as RBF networks) certain model a value for the weight vector which minimises the selection criteria can be di erentiated to yield a re- cost is estimation formula for the regularisation parameter w = A;1H>y ^ such that an initial guess can be iteratively improved where Hij = hj (xi ) is the design matrix and contains until a local minimum of the criterion is reached. In the responses of the m centres to the p inputs of the this paper we discuss some enhancements of this gen- training set, A = H>H + Im and Im is the m- eral approach including improved computational e - dimensional identity matrix. ciency, detection of the global minimum and simulta- The adjustable parameters in the model are the neous optimisation of the basis function widths. The m weights wj , the basis function width r and the bene ts of these improvements are demonstrated on a regularisation parameter . The xed parameters are practical problem. the centre positions cj and their number m. Below we will assume that the inputs of the training set are used as the xed centres, in which case m = p and 1 Introduction cj = xj , but our results apply equally to other choices of xed centres. Consider a radial basis function (RBF) network Various model selection criteria, such as gener- with centres at fcj g, weights fwj g and radial func- alised cross-validation (GCV) 2] or the marginal like- tions lihood of the data (the evidence") 3] can be di er- entiated and set equal to zero to yield a re-estimation h (x) = exp ; (x ; cj ) (x ; cj ) > j formula for the regularisation parameter. For exam- r2 ple in a previous paper 5] we derived the following formula from GCV (j = 1 : : : m) all having the same width r. The cen- = p ; w>^ ;1 w e^ e tres are xed but the weights and width are adapt- > able. The response of the network to an input x is ^ A ^ (1) m X where ^ = y ; H w, = m ; tr A;1 (the e ective e ^ ; f (x) = wj hj (x) number of parameters 4]) and = tr A;1 ; A;2 . j =1 However there are problems with simply trying to Suppose that this network is trained on a regression iterate equation (1) to convergence. Firstly, depend- data set fxi yi g (i = 1 : : : p) by minimising the a ing on the initial guess, a non-optimal local minimum penalised sum-squared-error cost function may be found and secondly, inversion of A is liable to become numerically unstable if gravitates to- C (w) = e> e + w> w wards very small values. Furthermore, the value of the width parameter r remains xed. In the next where w is the m-dimensional weight vector and e is section we describe a computationally e cient and the p-dimensional error vector, ei = yi ; f (xi ). The numerically stable method of iterating (1) which is second term penalises large weights and is designed fast enough that the global minimum and the opti- to avoid over t should the unregularised model be mal value of r can be found by explicit search.
  • 2. 2 E cient Computation RBF network with m = 60 centres coincident with the input points and basis functions of xed width Continually recomputing the inverse of A each r = 0:2. Figure 1 shows the variation of GCV with time the value of changes requires of order m3 for one particular realisation of this problem. oating point operations per iteration and is vulner- able to numerical instability if becomes very small. A more e cient and stable method involves initially computing the eigenvalues f i g and eigenvectors fui g of HH> and fzi g, the projections of the data onto the eigenvectors (zi = y> ui ). Thereafter, the four terms appearing in (1) can be computed e ciently log(GCV) (with cost only linear in p) by p X e>e = zi 2 2 (2) i=1 ( i + )2 Xp w>A;1 w = i zi2 (3) i=1 ( i + ) 3 Xp −16 −10 −4 2 = i (4) log(λ) i=1 ( i + )2 Xp Figure 1: Local (diamonds) and global (star) p; = (5) minima of GCV. i=1 i + Note that if p > m then the last p ; m eigenvalues An array of 50 trial values of , evenly spaced be- (assuming they are ordered from largest to smallest) tween log = ;16 and log = 2, were used to nd are zero. However, as remarked earlier, if we have one rough positions for the local minima and equation (1) centre for each training set input then p = m and the was then iterated for each one found. Convergence cost of calculating the eigenvalues and eigenvectors was assumed once changes in GCV from one itera- of the p p matrix HH> is of the same order as tion to the next had dipped below a threshold of 1 inverting the m m matrix A. Therefore, unless part in a million. In the example problem three local (1) converges almost immediately, it is much more minima were detected (see gure 1) and the one with e cient to calculate the eigensystem once and then the lowest GCV corresponded to 2. Searching use (2-5) than to invert A on each iteration. for the minima and re ning the candidate solutions Once the eigensystem has been established, GCV took up only 0.6% of the total computation time, the rest was accounted for by the calculation of eigenval- GCV = (p ; ^ 2 p^ e e > ues and eigenvectors. Notice that if we had simply ) started with a single guess for and iterated equa- tion (1) to nd the solution, any initial guess below can also be cheaply calculated for any given us- about 10;4 would have led to a sub-optimal solution. ing (2,5). Thus it is feasible to evaluate GCV for a Occasionally the value of re-estimated from equa- number of trial values of searching for local min- tion (1) bounces back and forward between two val- ima, re ne those that are found by iterating (1) to ues on each side of a local minimum and then ei- convergence { using (2-5) of course { and nally se- ther takes a long time to pass through this bistable lect from the local minima the one with the smallest state before nally converging or does not converge GCV. Assuming a wide and dense enough range of at all. To solve this problem we devised the follow- trial values is employed, this procedure will nd the ing heuristic. Suppose the sequence of re-estimated global minimum. values is 1 : : : k;2 k;1 k , with k being the We now demonstrate this method on a toy prob- current value. Then if lem consisting of p = 60 samples taken from the func- j k; k;1 j > j k ; k;2 j tion y = 0:8 sin(6 x) at random points in the range 0 < x < 1 and corrupted by Gaussian noise of stan- replace k by the geometric mean of k;1 and k;2 dard deviation 0.1. The data was modelled by an before proceeding to the next iteration.
  • 3. 3 Optimising the Width Usually this means the location of the global mini- mum also changes smoothly with r but there are par- When GCV is di erentiated with respect to the ticular values of the width where the identity of the regularisation parameter and set equal to zero the local minima with the smallest GCV switches, caus- resulting equation can be manipulated so that alone ing an abrupt change in location (but not height) of appears on the left hand side, enabling the equation to the global minimum. This explains the discontinuous be used as a re-estimation formula. Unfortunately the changes of slope in the curve of gure 2. Local min- same trick does not work with r because, after setting ima can also be created or destroyed as r changes, so the derivative of GCV with respect to r to zero, the discontinuous changes in value are also possible. terms explicitly involving r cancel so r cannot be iso- Of course, the ultimate arbiter of generalisation lated and a re-estimation formula is impossible. The performance is not the value of a model selection cri- same applies to other model selection criteria such as terion (such as GCV) on a particular realisation of maximum likelihood of the data 6]. the problem but the error of an independent test set When there is only one width parameter, as we as- averaged over multiple realisations. We perform such sume here, it is feasible to tackle the problem of choos- a test in the next section. ing an optimal value by experimenting with a number of trial values and selecting the one most favoured by the model selection criterion. The range of trial values 4 Results used will be problem speci c and could be determined For a thorough test of the method we turn to a by the likely maximum and minimum scales involved more realistic problem stemming from Friedmann's in the particular problem. The number of trial values MARS paper 1] and later used to compare RBFs between the these limits will depend on the size of the and MARS 5]. The problem involves the prediction problem (p) and the available computing resources of impedance Z and phase from the four parameters since for each trial value an eigensystem computation (resistance, frequency, inductance and capacitance) of (with cost proportional to p3 ) will be necessary. an electrical circuit. Training sets of three di erent sizes (100, 200, 400) and with a signal-to-noise ra- tio of about 3:1 were replicated 100 times each. The input components were normalised to have unit vari- ance and zero mean for each replication. The learning method, as described above, was applied using a set of 10 trial values of r between 1 and 10. Generali- sation performance was estimated by scaled sum of squared errors over two independent test sets (one GCV for Z and one for ) of size 5000 and uncorrupted by noise. This is the same experimental set up as in the previous papers 1, 5] from which further details can be obtained. Z p NEW OLD NEW OLD 0 0.2 0.4 0.6 0.8 1 100 0.34 0.45 0.27 0.26 r 200 0.19 0.26 0.18 0.20 400 0.14 0.14 0.13 0.16 Figure 2: Tracking the global minimum with respect to as r changes. Table 1: Average generalisation errors for the new method, which optimises the width r, and an older method which does not. Figure 2 illustrates using the toy problem described earlier. It shows the value of GCV at the global min- imum over for 50 trial values of r between 0.1 and Table 1 summarises the results. The left hand col- 1.0. The value of r = 0:2, which we used earlier, ap- umn gives training set size. Two sets of results, one pears to have been a little on the small side. The for Z and one for , are given. The gures quoted are optimal value is close to 0.45. the average (over 100 replications) of the scaled sum As r changes the location ( ) and height (GCV) of squared prediction errors. Apart from the method of the local minima (see gure 1) change smoothly. described above, which involves optimisation of r, the
  • 4. average errors of an older RBF algorithm, regularised 5 Conclusions forward selection (RFS), are also quoted (taken from 5]). The main di erences to the method described We have described a new computational method here are that RFS uses a xed value of r and creates for re-estimating the regularisation parameter of an a parsimonious network. The latter has a relatively RBF network based on generalised cross-validation small a ect on generalisation performance. (GCV). It utilises an eigensystem related to the de- RFS is clearly inferior to the new method for the sign matrix of the regression problem and is more Z problem and marginally worse for . We think e cient and more stable than methods which involve the optimisation of r for each training set explains a direct matrix inverse at each iteration. We have ex- the superior performance of the new method and the tended the algorithm to optimise the basis function lack of such optimisation is a partial explanation for width simply by testing a number of trial values and the poor performance of RFS compared to MARS 5]. selecting the one associated with the smallest value The xed value of r used for RFS was 3.5 but the av- of GCV. erage optimal values determined by the new method We tested the method on a practical problem in- were 8.7 for Z and 2.8 for . Thus it looks as if the volving 4 input dimensions and a few hundred train- xed value used for RFS was an underestimate in the ing examples. Our method, which can adapt the case of Z (where the new algorithm considerably im- width of the basis functions, but not their number, proved the results) but about right for (where the was found to have better prediction performance than new method made less of an impact). a similar RBF network which can adapt the number of functions but is stuck with the same xed width. The new method, with its head-on approaches to nding the global minimum with respect to the regu- larisation parameter and to optimising the basis func- tion width, does not scale-up well for multiple regular- isation parameters or multiple widths. Additionally, there is a limit on how many training examples and basis functions can be handled due to the computa- tional cost of calculating the eigensystem. It is best Z suited to problems involving a single regularisation parameter, a single basis function width and about 1000 (or less) training set examples. References 2 2 1] J. Friedman. Multivariate adaptive regression splines 0 (with discussion). Annals of Statistics, 19:1{141, 1991. 0 2] G. Golub, M. Heath, and G. Wahba. Generalised L −2 −2 C cross-validation as a method for choosing a good ridge parameter. Technometrics, 21(2):215{223, 1979. Figure 3: Z as a function of L and C. 3] D. MacKay. Bayesian interpolation. Neural Compu- tation, 4(3):415{447, 1992. 4] J. Moody. The e ective number of parameters: An analysis of generalisation and regularisation in non- Note that while r = 8:7 may sound rather large, es- linear learning systems. In J. Moody, S. Hanson, and pecially in view of the normalised input components, R. Lippmann, editors, Neural Information Processing such large basis function widths do not necessarily im- Systems 4, pages 847{854. Morgan Kaufmann, San ply a lack of structure in the tted function, as might Mateo CA, 1992. 5] M. Orr. Regularisation in the selection of radial basis be assumed. Figure 3 plots Z (impedance) against C function centres. Neural Computation, 7(3):606{623, (capacitance) and L (inductance) for xed values of 1995. the other two components (resistance and frequency). 6] M. Orr. An EM algorithm for regularised radial ba- This function was tted to one of the p = 200 train- sis function networks. In International Conference on ing sets for which the algorithm had found an optimal Neural Networks and Brain, Beijing, China, October basis function width of r = 10. The function still ex- 1998. hibits considerable structure over the ranges of L and C even though they are less than half the size of r.