We have seen in the single layer perceptron model, how the weights are updated using the error, which is defined as the deviation of the output \( o \) from the perceptron network to the expected output \(y\). $$ e = y -o $$
in a multi-layer network the challenge is how to use the error signal, which only appears at the output of the network, to modify weights in previous layers. Passing the error signal back down to these earlier layers is and adapting the weights is called backpropagation, it does not work with the hard threshold function that returns either \(0 \) or \(1 \), because any graded information is lost with the hard threshold function. In order to estimate how the error at the output layer should affect weights in previous layers you need to compute what is called the gradient pointing towards the change of weights that will produce a smaller error at the output layer.
The basic mathematical assumption is that the error at the different layers are a function of the weights in the network. Formally this is expressed in the following equation:
$$ \frac {\delta E}{\delta w} = \frac{\delta}{\delta w} \Sigma (y -o)^2 $$
this equation reads: the derivative of the error with respect to the weights (tells us the direction of change of the error with respect to weights) is a function of the change of the squared error with respect to the weights.
Using what is called the chain rule in mathematics, this expression can be further simplified to the following:
$$ \frac {\delta E}{\delta w} = -2(y-o) \times \frac{\delta o}{\delta w} $$
in order to finally implement this equation as an algorithm the expression \( \frac{\delta o}{\delta w} \) needs to be developed, it is the derivative of the network output function, which is the threshold function. As we have seen the hard limiter threshold has no derivative and is thus not suitable as threshold function, in contrast the S-shaped sigmoid function also works as a threshold but it is a continuous function and can be derived. The sigmoid function itself is defined as follows: \( \sigma(x) = \frac{1}{1-e^x}\) and its derivative is:
$$ \frac {\delta}{\delta x} \sigma(x) = \sigma(x) \times (1 – \sigma(x)) $$
If we replace the output term in the previous equations with the sigmoid \(\sigma\) function as it is applied to the weights we get to the following term:
$$ \frac {\delta E}{\delta w} = -2(y – o ) \times \frac{\delta }{\delta w} \sigma( \Sigma w \times o_{previousLayer})$$
replacing the derivative of the sigmoid we obtain:
$$ \frac {\delta E}{\delta w} = -2(y – o ) \times \sigma(\Sigma w \times o_{previousLayer}) \times (1-\sigma(\Sigma w \times o_{previousLayer})) \times \frac{\delta }{\delta w} \Sigma w \times o_{previousLayer}$$
the last term \(\frac{\delta }{\delta w} \Sigma w \times o_{previousLayer}\) is again a result of the chain law and reduces to \( o_{previousLayer} \), so that the final formula is the following:
$$ \frac {\delta E}{\delta w} = -2(y – o ) \times \sigma(\Sigma w \times o_{previousLayer}) \times (1-\sigma(\Sigma w \times o_{previousLayer})) \times o_{previousLayer} $$
in terms of code this equation can be further simplified, because the expression \( (y-o) \) can be replaced with the error term \(e\) and the multiplication with 2 can be dropped, we are just interested in the direction of change, the magnitude will be adapted through the learning rate factor.
$$ \frac {\delta E}{\delta w} = -e \times \sigma(\Sigma w \times o_{previousLayer}) \times (1-\sigma(\Sigma w \times o_{previousLayer})) \times o_{previousLayer} $$
As the expression of \( \sigma(\Sigma w \times o_{previousLayer} \) corresponds to the output of the current layer we can also write
$$ \frac {\delta E}{\delta w} = -e \times o_{currentLayer} \times (1-o_{currentLayer}) \times o_{previousLayer} $$
in order to apply this equation for every layer propagating back to the input, we need to compute the error for every layer, which is realized by multiplying the error of the next output with the transposed weights connected to that output.
$$ e_{currentLayer} = e_{nextLayer} w_{currentLayer2nextLayer}^T $$