Backpropagation is the foundation of the deep neural network. Usually, we consider it to be kind of ‘dark magic’ we are not able to understand. However, it should not be the black box which we stay away. In this article, I will try to explain backpropagation as well as the whole neural network step by step in the original mathematical way.
- Overview of the architecture
- Initialize parameters
- Implement forward propagation
- Compute Loss
- Implement Backward propagation
- Update parameters
This neural network I’m going to explain is a 2-Layer neural network. The first layer is Linear + Sigmoid, and the second Layer is Linear + Softmax.
The architecture in the math formula
We take one example which has two features like below
The parameters are taken randomly.
The Loss function here we use is cross-entropy cost
The actual output should be
Since we only have one example, that means ‘m = 1’, the total loss is computed as follows :
In this section, we will go through backward propagation stage by stage.
At first we know:
Then the derivation of Softmax is
Consider Who , we want to know how Who will affect the total error, aka the value of
Chain Rule states that:
So we have
Let’s break this through stage by stage
Finally we apply the chain rule:
Let’s go through all the weights in Layer2
Our training target is to make the prediction value approximate the correct value, while it can be transferred to minimize the error by updating weights with the help of learning rate. Suppose the learning rate is 0.02.
We got the updated weight matrix as folows
That is the updated weight of Layer1-Layer2. The update of Input-Layer weights is the same story I will illustrate as follows.
Follow the path of the previous chapter
Apply the chain rule:
We already got the second and third derivations, regarding the first derivation, we apply the chain rule again, but in the opposite direction.
We have computed the first and second results, and the third one is merely a deviation of the linear function
Then we got
Similarly , we can get the Layer0-Layer1 derivatives with respective to the total error
Update the weights with learning rate 0.02，we got the final weight matrix
Finally we get all the weights updated
- Backpropagation is beautiful designed architecture. Every gate in the diagram gets some input and makes some output, the gradients of input concerning the output indicates how strongly the gate wants the output to increase or decrease. The communication between these “smart” gates make it possible for complicated prediction or classification tasks.
- The activation function matters. Take Sigmoid as an example, and we saw the gradients of its gates “vanish” significantly to 0.00XXX, this will make the rest of backward pass almost to zero due to the multiplication in chain rule. So we should always be nervous in Sigmoid, Relu is possibly a better choice.
- If we look back to the computing process, a lot can be done when we implement the neural network with codes, such as the caching of gradients when we do forward propagation and the extracting of common gradient computation functions.
- http://cs231n.github.io/optimization-2/ I