Neural Network Backpropagation Derivation

I have spent a few days hand-rolling neural networks such as CNN and RNN. This post shows my notes of neural network backpropagation derivation. The derivation of Backpropagation is one of the most complicated algorithms in machine learning. There are many resources for understanding how to compute gradients using backpropagation. But in my opinion, most of them lack a simple example to demonstrate the problem and walk through the algorithm.

1. A Simple Neural Network

The following diagram shows the structure of a simple neural network used in this post. The input dimension (feature dimension) is 2, hidden layer size is 3, and the output dimension is 1. It is pretty intuitive to calculate the prediction by feeding forward the network. In the following diagram, instead of using real numbers, I use boxes to illustrate the dimension transformation through the layers.

2. Cost Function

A cost function reflects the distance between the ground truth and the predicted values.A simple cost function is sum squared error function:

The cost J is a function of W. So the goal is to find the best W which yields the lowest cost J. The cost function is often defined in a way which is mathematically convenient. Apparently, there could be multiple ways to evaluate the distance between the truth and the predicted values. Each cost function has its own applicable cases. You may check out this post to see some other cost functions.

In a simplified two dimensional space, the relation between the cost J and W may look like:

We keep updating W, so that the cost moves to the lowest point. The algorithm is called gradient descent. By calculating the partial derivatives of W, we can keep updating W and make the cost go lower and lower until reaching the lowest.