Backward propagation of Neural Network explained

Backpropagation is the foundation of the deep neural network. Usually, we consider it to be kind of ‘dark magic’ we are not able to understand. However, it should not be the black box which we stay away. In this article, I will try to explain backpropagation as well as the whole neural network step by step in the original mathematical way.


  • Overview of the architecture
  • Initialize parameters
  • Implement forward propagation
  • Compute Loss
  • Implement Backward propagation
  • Update parameters

1. The architecture

This neural network I’m going to explain is a 2-Layer neural network. The first layer is Linear + Sigmoid, and the second Layer is Linear + Softmax.


The architecture in the math formula

2.Initialize parameters

We take one example which has two features like below

The parameters are taken randomly.

3. Forward Propagation

3.1 Layer1:




3.2 Layer2:




4. Compute Loss

The Loss function here we use is cross-entropy cost

The actual output should be

Since we only have one example, that means ‘m = 1’, the total loss is computed as follows :

5. Backward Propagation

In this section, we will go through backward propagation stage by stage.

5.1 Basic Derivatives



At first we know:


Then the derivation of Softmax is

5.2 The backward Pass

5.2.1 Layer1-Layer2

Weight derivatives with respect to the error


Consider Who , we want to know how Who will affect the total error, aka the value of

Chain Rule states that:

So we have

Let’s break this through stage by stage

  • Stage1
  • Stage2
  • Stage3

Finally we apply the chain rule:

Let’s go through all the weights in Layer2

Update weights according to learning rate

Our training target is to make the prediction value approximate the correct value, while it can be transferred to minimize the error by updating weights with the help of learning rate. Suppose the learning rate is 0.02.

We got the updated weight matrix as folows

That is the updated weight of Layer1-Layer2. The update of Input-Layer weights is the same story I will illustrate as follows.

5.2.2 Layer0(Input Layer) - Layer1


Follow the path of the previous chapter

  • Stage1:
  • Stage2:

Apply the chain rule:

We already got the second and third derivations, regarding the first derivation, we apply the chain rule again, but in the opposite direction.

We have computed the first and second results, and the third one is merely a deviation of the linear function

Then we got

Similarly , we can get the Layer0-Layer1 derivatives with respective to the total error

Update weights according to learning rate

Update the weights with learning rate 0.02,we got the final weight matrix

5.3 Wrap up

Finally we get all the weights updated

6. Conclusion

  • Backpropagation is beautiful designed architecture. Every gate in the diagram gets some input and makes some output, the gradients of input concerning the output indicates how strongly the gate wants the output to increase or decrease. The communication between these “smart” gates make it possible for complicated prediction or classification tasks.
  • The activation function matters. Take Sigmoid as an example, and we saw the gradients of its gates “vanish” significantly to 0.00XXX, this will make the rest of backward pass almost to zero due to the multiplication in chain rule. So we should always be nervous in Sigmoid, Relu is possibly a better choice.
  • If we look back to the computing process, a lot can be done when we implement the neural network with codes, such as the caching of gradients when we do forward propagation and the extracting of common gradient computation functions.

7. Reference

  4. I