Backpropagation Neural Networks

Let’s start with an overview of how these networks work and then fill in more detail later. Backpropagation networks are trained by applying training inputs to the network input layer, propagate values through the network to the output neurons, compare the errors (or differences) between these propagated output values and the training data output values. These output errors are backpropagated though the network and the magnitude of backpropagated errors are used to adjust the weights in the network.

The example we look at here uses the plotlib package from an earlier chapter and the source code for the example is the file loving_snippet/backprop_neural_network.lisp.

We will use the following diagram to make this process more clear. There are four weights in this very simple network:

W^1,1 is the floating point number representing the connection strength between input_neuron¹ and output_neuron¹
W^2,1 connects input_neuron² to output_neuron¹
W^1,2 connects input_neuron¹ to output_neuron²
W^2,2 connects input_neuron² to output_neuron²

Understanding how connection weights connect neurons in adjacent layers

Before any training the weight values are all small random numbers.

Consider a training data element where the input neurons have values [0.1, 0.9] and the desired output neuron values are [0.9 and 0.1], that is flipping the input values. If the propagated output values for the current weights are [0.85, 0.5] then the value of the first output neuron has a small error abs(0.85 - 0.9) which is 0.05. However the propagated error of the second output neuron is high: abs(0.5 - 0.1) which is 0.4. Informally we see that the weights feeding input output neuron 1 (W^1,1 and W^2,1) don’t need to be changed much but the neuron that feeding input neuron 2 (W^1,2 and W^2,2) needs modification (the value of W^2,2 is too large).

Of course, we would never try to manually train a network like this but it is important to have at least an informal understanding of how weights connect the flow of value (we will call this activation value later) between neurons.

In this neural network see in the first figure we have four weights connecting the input and output neurons. Think of these four weights forming a four-dimensional space where the range in each dimension is constrained to small positive and negative floating point values. At any point in this “weight space”, the numeric values of the weights defines a model that maps the inputs to the outputs. The error seen at the output neurons is accumulated for each training example (applied to the input neurons). The training process is finding a point in this four-dimensional space that has low errors summed across the training data. We will use gradient descent to start with a random point in the four-dimensional space (i.e., an initial random set of weights) and move the point towards a local minimum that represents the weights in a model that is (hopefully) “good enough” at representing the training data.

This process is simple enough but there are a few practical considerations:

Sometimes the accumulated error at a local minimum is too large even after many training cycles and it is best to just restart the training process with new random weights.
If we don’t have enough training data then the network may have enough memory capacity to memorize the training examples. This is not what we want: we want a model with just enough memory capacity (as represented by the number of weights) to form a generalized predictive model, but not so specific that it just memorizes the training examples. The solution is to start with small networks (few hidden neurons) and increase the number of neurons until the training data can be learned. In general, having a lot of training data is good and it is also good to use as small a network as possible.

In practice using backpropagation networks is an iterative process of experimenting with the size of a network.

In the example program (in the file backprop_neural_network.lisp) we use the plotting library developed earlier to visualize neuron activation and connecting weight values while the network trains.

The following three screen shots from running the function test3 defined at the bottom of the file backprop_neural_network.lisp illustrate the process of starting with random weights, getting random outputs during initial training, and as delta weights are used to adjust the weights in a network, then the training examples are learned:

At the start of the training run with random weights and large delta weights

In the last figure the initial weights are random so we get random mid-range values at the output neurons.

The trained weights start to produce non-random output

As we start to train the network, adjusting the weights, we start to see variation in the output neurons as a function of what the inputs are.

After training many cycles the training examples are learned, with only small output errors

In the last figure the network is trained sufficiently well to map inputs [0, 0, 0, 1] to output values that are approximately [0.8, 0.2, 0.2, 0.3] which is close to the expected value [1, 0, 0, 0].

The example source file backprop_neural_network.lisp is long so we will only look at the more interesting parts here. Specifically we will not look at the code to plot neural networks using plotlib.

The activation values of individual neurons are limited to the range [0, 1] by first calculating their values based on the sum activation values of neurons in the previous layer times the values of the connecting weights and then using the Sigmoid function to map the sums to the desired range. The Sigmoid function and the derivative of the Sigmoid function (dSigmoid) look like:

Sigmoid and Derivative of the Sigmid Functions

Here are the definitions of these functions:

1 (defun Sigmoid (x)
2   (/ 1.0 (+ 1.0 (exp (- x)))))
3 
4 (defun dSigmoid (x)
5   (let ((temp (Sigmoid x)))
6     (* temp (- 1.0 temp)))

The function NewDeltaNetwork creates a new neual network object. This code allocates storage for input, hidden, output layers (I sometimes refer to neuron layers as “slabs”), and the connection weights. Connection weights are initialized to small random values.

 1 ; (NewDeltaNetwork sizeList)
 2 ;       Args:   sizeList = list of sizes of slabs. This also defines
 3 ;                          the number of slabs in the network.
 4 ;                          (e.g.,  '(10 5 4) ==> a 3-slab network with 10
 5 ;                           input neurons, 5 hidden neurons, and 4 output
 6 ;                           neurons).
 7 ;
 8 ;       Returned value = a list describing the network:
 9 ;          (nLayers sizeList
10 ;           (activation-array[1] .. activation-array[nLayers])
11 ;           (weight-array[2] .. weight-array[nLayers])
12 ;           (sum-of-products[2] .. sum-of-products[nLayers[nLayers])
13 ;           (back-prop-error[2] .. back-prop-error[nLayers]))
14 ;           (old-delta-weights[2] .. for momentum term
15 
16                                        :initial-element 0.0))
17            (reverse old-dw-list)))
18 
19     ;;
20      ;  Initialize values for all activations:
21      ;;
22     (mapc
23      (lambda (x)
24         (let ((num (array-dimension x 0)))
25           (dotimes (n num)
26             (setf (aref x n) (frandom 0.01 0.1)))))
27      a-list)
28 
29     ;;
30      ;  Initialize values for all weights:
31      ;;
32     (mapc
33      (lambda (x)
34         (let ((numI (array-dimension x 0))
35               (numJ (array-dimension x 1)))
36           (dotimes (j numJ)
37             (dotimes (i numI)
38               (setf (aref x i j) (frandom -0.5 0.5))))))
39      w-list)
40     (list numLayers sizeList a-list s-list w-list dw-list
41           d-list old-dw-list alpha beta)))

In the following listing the function DeltaLearn processes one pass through all of the training data. Function DeltaLearn is called repeatedly until the return value is below a desired error threshold. The main loop over each training example is implemented in lines 69-187. Inside this outer loop there are two phases of training for each training example: a forward pass propagating activation from the input neurons to the output neurons via any hidden layers (lines 87-143) and then the weight correcting backpropagation of output errors while making small adjustments to weights (lines 148-187):

  1 ;;
  2  ;  Utility function for training a delta rule neural network.
  3  ;  The first argument is the name of an output PNG plot file
  4  ;  and a nil value turns off plotting the network during training.
  5  ;  The second argument is a network definition (as returned from
  6  ;  NewDeltaNetwork), the third argument is a list of training
  7  ;  data cases (see the example test functions at the end of this
  8  ;  file for examples.
  9  ;;
 10 
 11 (defun DeltaLearn (plot-output-file-name
 12            netList trainList)
 13   (let ((nLayers (car netList))
 14         (sizeList (cadr netList))
 15         (activationList (caddr netList))
 16         (sumOfProductsList (car (cdddr netList)))
 17         (weightList (cadr (cdddr netList)))
 18         (deltaWeightList (caddr (cdddr netList)))
 19         (deltaList (cadddr (cdddr netList)))
 20         (oldDeltaWeightList (cadddr (cdddr (cdr netList))))
 21         (alpha (cadddr (cdddr (cddr netList))))
 22         (beta (cadddr (cdddr (cdddr netList))))
 23         (inputs nil)
 24         (targetOutputs nil)
 25         (iDimension nil)
 26         (jDimension nil)
 27         (iActivationVector nil)
 28         (jActivationVector nil)
 29         (n nil)
 30         (weightArray nil)
 31         (sumOfProductsArray nil)
 32         (iDeltaVector nil)
 33         (jDeltaVector nil)
 34         (deltaWeightArray nil)
 35         (oldDeltaWeightArray nil)
 36         (sum nil)
 37         (iSumOfProductsArray nil)
 38         (error nil)
 39         (outputError 0)
 40         (delta nil)
 41         (eida nil)
 42         (inputNoise 0))
 43 
 44     ;;
 45      ; Zero out deltas:
 46      ;;
 47     (dotimes (n (- nLayers 1))
 48       (let* ((dw (nth n deltaList))
 49              (len1 (array-dimension dw 0)))
 50         (dotimes (i len1)
 51           (setf (aref dw i) 0.0))))
 52 
 53     ;;
 54      ; Zero out delta weights:
 55      ;;
 56     (dotimes (n (- nLayers 1))
 57       (let* ((dw (nth n deltaWeightList))
 58              (len1 (array-dimension dw 0))
 59              (len2 (array-dimension dw 1)))
 60         (dotimes (i len1)
 61           (dotimes (j len2)
 62             (setf (aref dw i j) 0.0)))))
 63 
 64     (setq inputNoise *delta-default-input-noise-value*)
 65 
 66     ;;
 67      ;  Main loop on training examples:
 68      ;;
 69     (dolist (tl trainList)
 70 
 71       (setq inputs (car tl))
 72       (setq targetOutputs (cadr tl))
 73 
 74       (if *delta-rule-debug-flag*
 75         (print (list "Current targets:" targetOutputs)))
 76 
 77       (setq iDimension (car sizeList)) ; get the size of the input slab
 78       (setq iActivationVector (car activationList)) ; input activations
 79       (dotimes (i iDimension) ; copy training inputs to input slab
 80         (setf
 81          (aref iActivationVector i)
 82          (+ (nth i inputs) (frandom (- inputNoise) inputNoise))))
 83       ;;
 84        ; Propagate activation through all of the slabs:
 85        ;;
 86       (dotimes (n-1 (- nLayers 1))  ; update layer i to layer flowing to layer j
 87         (setq n (+ n-1 1))
 88         (setq jDimension (nth n sizeList)) ; get the size of the j'th layer
 89         (setq jActivationVector (nth n activationList)) ; activation  for slab j
 90         (setq weightArray (nth n-1 weightList))
 91         (setq sumOfProductsArray (nth n-1 sumOfProductsList))
 92         (dotimes (j jDimension) ; process each neuron in slab j
 93           (setq sum 0.0) ; init sum of products to zero
 94           (dotimes (i iDimension) ; activation from neurons in previous slab
 95             (setq
 96              sum
 97              (+ sum (* (aref weightArray i j) (aref iActivationVector i)))))
 98           (setf (aref sumOfProductsArray j) sum) ; save sum of products
 99           (setf (aref jActivationVector j) (Sigmoid sum)))
100         (setq iDimension jDimension)     ; reset index for next slab pair
101         (setq iActivationVector jActivationVector))
102       ;;
103        ; Activation is  spread through the network and sum of products
104        ; calculated. Now modify the weights in the network using back
105        ; error propagation. Start by calculating the error signal for
106        ; each neuron in the output layer:
107        ;;
108       (setq jDimension (nth (- nLayers 1) sizeList)) ; size of last layer
109       (setq jActivationVector (nth (- nLayers 1) activationList))
110       (setq jDeltaVector (nth (- nLayers 2) deltaList))
111       (setq sumOfProductsArray (nth (- nLayers 2) sumOfProductsList))
112       (setq outputError 0)
113       (dotimes (j jDimension)
114         (setq delta (- (nth j targetOutputs) (aref jActivationVector j)))
115         (setq outputError (+ outputError (abs delta)))
116         (setf
117          (aref jDeltaVector j)
118          (+
119           (aref jDeltaVector j)
120           (* delta (dSigmoid (aref sumOfProductsArray j))))))
121       ;;
122        ; Now calculate the backpropagated error signal for all hidden slabs:
123        ;;
124       (dotimes (nn (- nLayers 2))
125         (setq n (- nLayers 3 nn))
126         (setq iDimension (nth (+ n 1) sizeList))
127         (setq iSumOfProductsArray (nth n sumOfProductsList))
128         (setq iDeltaVector (nth n deltaList))
129         (dotimes (i iDimension)
130           (setf (aref iDeltaVector i) 0.0))
131         (setq weightArray (nth (+ n 1) weightList))
132         (dotimes (i iDimension)
133           (setq error 0.0)
134           (dotimes (j jDimension)
135             (setq error
136                   (+ error (* (aref jDeltaVector j) (aref weightArray i j)))))
137           (setf
138            (aref iDeltaVector i)
139            (+
140             (aref iDeltaVector i)
141             (* error (dSigmoid (aref iSumOfProductsArray i))))))
142         (setq jDimension iDimension)
143         (setq jDeltaVector iDeltaVector))
144 
145       ;;
146        ; Update all delta weights in the network:
147        ;;
148       (setq iDimension (car sizeList))
149       (dotimes (n (- nLayers 1))
150         (setq iActivationVector (nth n activationList))
151         (setq jDimension (nth (+ n 1) sizeList))
152         (setq jDeltaVector (nth n deltaList))
153         (setq deltaWeightArray (nth n deltaWeightList))
154         (setq weightArray (nth n weightList))
155         (setq eida (nth n eidaList))
156 
157         (dotimes (j jDimension)
158           (dotimes (i iDimension)
159             (setq delta (* eida (aref jDeltaVector j) (aref iActivationVector i)))
160             (setf
161              (aref DeltaWeightArray i j)
162              (+ (aref DeltaWeightArray i j) delta)))) ; delta weight changes
163 
164         (setq iDimension jDimension))
165 
166     ;;
167      ; Update all weights in the network:
168      ;;
169     (setq iDimension (car sizeList))
170     (dotimes (n (- nLayers 1))
171       (setq iActivationVector (nth n activationList))
172       (setq jDimension (nth (+ n 1) sizeList))
173       (setq jDeltaVector (nth n deltaList))
174       (setq deltaWeightArray (nth n deltaWeightList))
175       (setq oldDeltaWeightArray (nth n oldDeltaWeightList))
176       (setq weightArray (nth n weightList))
177       (dotimes (j jDimension)
178         (dotimes (i iDimension)
179           (setf
180            (aref weightArray i j)
181            (+ (aref weightArray i j)
182               (* alpha (aref deltaWeightArray i j))
183               (* beta  (aref oldDeltaWeightArray i j))))
184            (setf (aref oldDeltaWeightArray i j) ; save current delta weights
185                  (aref deltaWeightArray i j)))) ; ...for next momentum term.
186       (setq iDimension jDimension))
187 
188     (if plot-output-file-name
189         (DeltaPlot netList plot-output-file-name)))
190 
191     (/ outputError jDimension)))

The function DeltaRecall in the next listing can be used with a trained network to calculate outputs for new input values:

 1 ;;
 2  ;  Utility for using a trained neural network in the recall mode.
 3  ;  The first argument to this function is a network definition (as
 4  ;  returned from NewDeltaNetwork) and the second argument is a list
 5  ;  of input neuron activation values to drive through the network.
 6  ;  The output is a list of the calculated activation energy for
 7  ;  each output neuron.
 8  ;;
 9 (defun DeltaRecall (netList inputs)
10   (let ((nLayers (car netList))
11         (sizeList (cadr netList))
12         (activationList (caddr netList))
13         (weightList (cadr (cdddr netList)))
14         (iDimension nil)
15         (jDimension nil)
16         (iActivationVector nil)
17         (jActivationVector nil)
18         (n nil)
19         (weightArray nil)
20         (returnList nil)
21         (sum nil))
22     (setq iDimension (car sizeList)) ; get the size of the input slab
23     (setq iActivationVector (car activationList)) ; get input activations
24     (dotimes (i iDimension) ; copy training inputs to input slab
25       (setf (aref iActivationVector i) (nth i inputs)))
26     (dotimes (n-1 (- nLayers 1))  ; update layer j to layer i
27       (setq n (+ n-1 1))
28       (setq jDimension (nth n sizeList)) ; get the size of the j'th layer
29       (setq jActivationVector (nth n activationList)) ; activation for slab j
30       (setq weightArray (nth n-1 weightList))
31       (dotimes (j jDimension) ; process each neuron in slab j
32         (setq sum 0.0) ; init sum of products to zero
33         (dotimes (i iDimension) ; get activation from each neuron in last slab
34           (setq
35            sum
36            (+ sum (* (aref weightArray i j) (aref iActivationVector i)))))
37         (if *delta-rule-debug-flag*
38           (print (list "sum=" sum)))
39         (setf (aref jActivationVector j) (Sigmoid sum)))
40       (setq iDimension jDimension) ; get ready for next slab pair
41       (setq iActivationVector jActivationVector))
42       (dotimes (j jDimension)
43         (setq returnList (append returnList (list (aref jActivationVector j)))))
44       returnList))

We saw three output plots earlier that were produced during a training run using the following code:

 1 (defun test3 (&optional (restart 'yes) &aux RMSerror) ; three layer network
 2   (if
 3     (equal restart 'yes)
 4     (setq temp (newdeltanetwork '(5 4 5))))
 5   (dotimes (ii 3000)
 6     (let ((file-name
 7      (if (equal (mod ii 400) 0)
 8          (concatenate 'string "output_plot_" (format nil "~12,'0d" ii) ".png")
 9          nil)))
10       (setq
11        RMSerror
12        (deltalearn
13           file-name temp
14           '(((1 0 0 0 0) (0 1 0 0 0))
15             ((0 1 0 0 0) (0 0 1 0 0))
16             ((0 0 1 0 0) (0 0 0 1 0))
17             ((0 0 0 1 0) (0 0 0 0 1))
18             ((0 0 0 0 1) (1 0 0 0 0)))))
19       (if (equal (mod ii 50) 0) ;; print error out every 50 cycles
20         (progn
21           (princ "....training cycle \#")
22           (princ ii)
23           (princ " RMS error = ")
24           (princ RMSerror)
25           (terpri))))))

Here the function test3 defines training data for a very small test network for a moderately difficult function to learn: to rotate the values in the input neurons to the right, wrapping around to the first neuron. The start of the main loop in line calls the training function 3000 times, creating a plot of the network every 400 times through the main loop.

Backpropagation networks have been used sucessfully in production for about 25 years. In the next chapter we will look at a less practical type of network, Hopfield networks, that are still interesting because the in some sense Hopfield networks model how our brains work. In the final chapter we will look at deep learning neural networks.

Up next

Hopfield Neural Networks