Backpropagation from Scratch

Artificial Intelligence (AI) is a very hot buzzword these days. Many people are excited (or scared 😱, depending on how you view AI) about its capabilities! Machine learning is the engine that powers any AI system. We can classify supervised machine learning algorithms into two categories. First one is traditional algorithms which are not based on any artificial neural network architecture, such as logistic regression, support vector machine, decision trees, random forest, naive Bayes etc. The other category is based on artificial neural network architecture. Today’s AI capabilities are mostly based on these powerful neural network-based models. In the heart of many powerful models such Large Language Models (LLM) for text and Convolutional Neural Networks (CNN) for images there is a neural network architecture. It is intriguing and very much worth understanding how a neural network works. Neural networks are the main core for most of today’s sophisticated ML algorithms. Neural network tries to learn a complex function connecting inputs and outputs using a layered architecture with nodes. Backpropagation is one of the fundamental and important algorithms it can use to learn the function.

What Is Backpropagation?:

Backpropagation is an algorithm used to train artificial neural networks. It works by propagating the error or the loss from the output layer back to the input layer, adjusting the weights of the connections between neurons as it goes. This weight adjustment is done in a way so that progressively overall loss of the mode gets reduced. This process of minimizing the loss function by adjusting the weights is called the gradient descent, an optimization algorithm.

At the end you will see that backpropagation has some similarities with the perceptron algorithm we discussed earlier. Whereas perceptron was applicable for linearly separable classes, neural networks can work on nonlinear cases using a nonlinear activation function in its hidden layers. Today we will discuss how the backpropagation algorithm works with a very simple example.

Let’s say we have a neural network with one input, one hidden layer with two neurons, and one output neuron. The input is 2, and the output is 1.

Thus, our neural network has the following components.

Input, x = 2 and target output, y = 1. The neural network will start with random weights. Let’s assume the following weights for w1, w2, w3, and w4.

w1 = 0.1, w2 = 0.2, w3 = 0.5, w4 = 0.6

Biases, b1 = 0.3, b2 = 0.4, b3 = 0.7

Assume our learning rate = 1.0 (In real problems it should be a very small number, but for our toy example let’s give learning rate a large value of 1.0)

At first we start with the Forward pass in the hidden layer. We have two neurons in the hidden layer with inputs h1 and h2. To keep calculations simple we can estimate the values of h1 and h2 individually. Later we will explore the options of vectorization to make calculations more efficient. To calculate the values of h1 and h2 we do a linear combination of inputs and weights and then add the bias terms.

Forward Pass:

h1 = w1*x + b1

h1 = 0.1*2 + 0.3 = 0.5

h2 = w2*x + b2 = 0.2*2 + 0.4 = 0.8

Until now we have only done some linear combinations and so far these operations are more similar to a linear regression model. If we continue with these types of linear combinations then regardless of how many layers and neurons we add eventually it will be like a linear regression model without much power that neural networks can achieve. To give neural networks the power to learn any nonlinear functions we add non-linearity to the model. This process of applying non-linearity is also called application of activation function in a neural network. For example, in our use-case we can apply a sigmoid function to our linear combination results to help it learn more complex functions. Although a sigmoid function would not be very practical for deeper neural networks where gradient calculations would make the gradients gradually disappear, we can still use it for our simple case. Also, sigmoid function is one of the earlier functions we used to have as activation functions before researchers found that rectified linear units (ReLU) and its variants are indeed superior choices for deeper neural networks. 

To continue with our simple problem, next we calculate the output of the neurons by applying logistic sigmoid activation functions to our linear combination results of h1 and h2. A logistic sigmoid function squashes values between 0 and 1. In the following we are applying a sigmoid function to h1 and h2 and calling the results a1 and a2 respectively.

a1 = sigmoid(h1) and a2 = sigmoid(h2)

The standard logistic sigmoid function is the following:

\[f(z) = \frac{1}{1 + e^{-z}}\]
Thus, \[a1 = f(h1) = \frac{1}{1 + e^{-h1}}\]

We can also write a1 = sigmoid(h1) … … … eqn(1)

and a2 = sigmoid(h2) … … … eqn(2)

After doing the calculations we get:

a1 = 0.622

a2 = 0.689

Now, we can move to the single-neuron output layer in our neural network. At first we find the input of the neuron o1.

o1 = w3 * a1 + w4 * a2 + b3

or, o1 = 0.5 * 0.622 + 0.6 * 0.689 + 0.7 = 1.4244

Now, let’s find the output of this neuron after applying the non-linear activation function.

a3 = sigmoid(o1) = sigmoid(1.4244) = 0.806 … … … eqn(3)

Here a3 is also the model predicted output.

Thus,  ypred= 0.806

Now, we are done with our prediction of the neural network! Next we calculate the loss of our neural network which captures how much off is our neural network prediction from the true value. For our case the true value is 1 and our predicted value is 0.806. We can calculate the loss using any convenient loss function that is applicable for our use case. In our toy example, to keep things simple, let’s consider a mean squared error (MSE) loss. We calculate the MSE loss in the following way:

\[L = \frac{1}{n} \sum_{i=1}^{n}(y_{pred} – y)^ 2\]

We have only one input value of x, x=2. Thus for our case sample size, n = 1 and the above equation simplifies to:

\[L = (y_{pred} – y)^ 2\]

Putting the values of ypred and y, L = 0.0376

We are done with  the forward pass with our neural network. Now is the most exciting part! The backpropagation part! But, before going to the backpropagation let’s see the intuition of backpropagation. Our model has predicted a value that is not very close to the actual value and thus we have an error. If our model could predict a value that is very close to the actual value of 1 then the error would be very less. To achieve that we need to update the existing weights so that our prediction becomes closer to the actual value. Now, how to connect loss/error with weights? The answer is backpropagation. By using the backpropagation we will update the weights by first finding how much update would be necessary for each weight.

Backpropagation:

As mentioned above, the main idea of backpropagation is to distribute the loss or error from the output layer to the entire network to update the weights in a way so that in the next iteration we can reduce the overall loss. During the first step in backpropagation we calculate the derivative of the loss with respect to the weights of the output layer.

We then use the derivative to update the weights. We do this by multiplying the derivative by a learning rate. The learning rate is a hyperparameter that controls how much the weights are updated. It helps a model to learn faster or slower. Let’s start from the output layer:

Output Layer:

In the output layer we have two weights connecting from the hidden layer to the output neuron, w3 and w4. As the loss is a function of multiple weights we need to find partial derivatives of the loss w.r.t the weights. We’ll start with updating weight for w3. We need to find loss w.r.t w3 before we can update weight for the next iteration. Namely, we need to find ∂L/∂w3 . Now, notice that the loss, L is a function of ypred and y, it’s not a function of w3. Then, how do we find the partial derivative of the loss w.r.t w3? This is the best kept secret of backpropagation!  We use the chain rule of calculus. Although L is not a function of w3, it’s a function of ypred. Again, ypred is a function of o1 as ypred = a3, and o1 is a function of w3. 

I know this part is a little involved. But, rest assured, if you understand the above trick of using chain rule to find partial derivatives of loss w.r.t. weights then you are 90% done understanding backpropagation!

Thus, using the chain rule of calculus we can find the partial derivative of the composite loss function L w.r.t. w3. Following are the details of the calculations.

$$ \frac{\partial L}{\partial w3} = \frac{\partial L}{\partial a3} * \frac{\partial a3}{\partial o1} * \frac{\partial o1}{\partial w3}$$

Now, we can check each of these three partial derivatives of the right side individually to find the final result.

$$ \frac{\partial L}{\partial w3} = \frac{\partial }{\partial a3} * (y_{pred} – y)^ 2 $$ Or $$ \frac{\partial L}{\partial w3} = \frac{\partial }{\partial y_{pred}} * (y_{pred} – y)^ 2 $$
$$ = 2(y_{pred} – y) * 1 = 2(y_{pred} – y) = -0.388 $$

Now, if we do the 2nd derivative:

$$ \frac{\partial a3}{\partial o1} = \frac{\partial }{\partial o1} (\frac{1}{1 + e^{-o1}}) = \frac{\partial }{\partial o1}sigmoid(o1) $$

Now, the partial derivative of a sigmoid function is the sigmoid function multiplied by the term (1 – sigmoid function). If you are familiar with calculus then you can do it easily. If not then in another post I will cover how that works but for now please trust me 🙂

Thus, $$ \frac{\partial }{\partial o1}sigmoid(o1) = sigmoid(o1) * (1- sigmoid(o1)) = a3(1-a3) $$

= 0.806 * (1- 0.806) = 0.156

In the above we have replaced sigmoid(o1) with a3 as we found earlier in eqn (3).

The third derivative:

$$ \frac{\partial o1}{\partial w3} = \frac{\partial }{\partial w3} (w3*a1 + w4*a2 + b3)=a1=0.622 $$

Thus finally combining all the parts:

$$ \frac{\partial L}{\partial w3} =-0.388*0.156*0.622=-0.03765 $$

Now, we are ready to update the weight of w3 for the next iteration. To update the weight we use the result of the partial derivative w.r.t the weight, multiply it by a learning rate and subtract the result from the weight. Thus for w3 following would be the formula:

$$ w3_{new} = w3 – \alpha* \frac{\partial L}{\partial w3}=0.5 – 1.0*(-0.0376) = 0.5376 $$

Now, we can start updating the weight for w4:

$$ \frac{\partial L}{\partial w4} = \frac{\partial L}{\partial a3} * \frac{\partial a3}{\partial o1} * \frac{\partial o1}{\partial w4}$$

If you note, the first two partial derivatives are the same as for w3. Thus, we already have those calculated. All we need to find is the last derivative.

The third derivative:

$$ \frac{\partial o1}{\partial w4} = \frac{\partial }{\partial w4} (w3*a1 + w4*a2 + b3)=a2=0.689 $$

Now, putting all three values:

$$ \frac{\partial L}{\partial w4} =-0.388*0.156*0.689=-0.0417 $$

Now, we can update the weight for w4:

$$ w4_{new} = w4 – \alpha* \frac{\partial L}{\partial w4}=0.6 – 1.0*(-0.0417) = 0.6417 $$

Hidden Layer:

In the hidden layer we have 2 weights, w1 and w2 connecting from the input to the two neurons of the hidden layer, h1 and h2. As we did for the output layer we need to find the partial derivatives of loss w.r.t w1 and w2 (∂L/∂w1,  ∂L/∂w2) to update the weights of w1 and w2 for the next iteration. Let’s start with ∂L/∂w1:

$$ \frac{\partial L}{\partial w1} = \frac{\partial L}{\partial a3} * \frac{\partial a3}{\partial o1} * \frac{\partial o1}{\partial a1 }* \frac{\partial a1}{\partial h1} * \frac{\partial h1}{\partial w1} $$

If you notice here, the first two partial derivatives are already calculated in the output layer section and thus we can calculate the rest three derivatives to find the value of the whole expression in the right side. Let’s start with the third term, ∂o1/∂a1:

$$ \frac{\partial o1}{\partial a1 } = \frac{\partial }{\partial a1 } (w3*a1 + w4*a2 + b3) = w3 = 0.5 $$
Now the 4th term, $$ \frac{\partial a1}{\partial h1 } = \frac{\partial }{\partial h1 }(\frac{1}{1 + e^{-h1}}) = sigmoid(h1)*(1-sigmoid(h1)) $$
$$ =a1(1-a1) =0.622(1-0.622)=0.235 $$

In the above we have replaced sigmoid(h1) with a1 as we found earlier in eqn (1).

And the last term, ∂h1/∂w1:

$$ \frac{\partial h1}{\partial w1}=\frac{\partial}{\partial w1}(w1*x+b1)=x=2 $$

Now, we can put all the term values in the equation to find the value of ∂L/∂w1:

$$ \frac{\partial L}{\partial w1} = -0.388* 0.156*0.5*0.235*2= -0.014 $$

Now, we can update the weight of w1 for the next iteration:

$$ w1_{new}= w1 – \alpha* \frac{\partial L}{\partial w1}= 0.1 – 1.0*(-0.014) = 0.114 $$

We now have only one more weight to update and then we can run the forward pass again to check if there is any improvement for loss. The last weight we need to update is w2 and we can update it in the same way we have updated for w1:

$$ \frac{\partial L}{\partial w2} = \frac{\partial L}{\partial a3} * \frac{\partial a3}{\partial o1} * \frac{\partial o1}{\partial a2}* \frac{\partial a2}{\partial h2} * \frac{\partial h2}{\partial w2} $$

And again as we have seen before first two terms will be the same and we need to find partial derivatives of the last 3 terms:

$$ \frac{\partial o1}{\partial a2}=\frac{\partial}{\partial a2}(w3*a1 + w4*a2 + b3) = w4=0.6 $$
Now the 4th term, $$ \frac{\partial a2}{\partial h2 } = \frac{\partial }{\partial h2 }(\frac{1}{1 + e^{-h2}}) = sigmoid(h2)*(1-sigmoid(h2)) $$

From eqn(2) we can put sigmoid(h2) = a2:

$$ \frac{\partial a2}{\partial h2 }=a2(1-a2) =0.689(1-0.689)=0.214 $$

The last term,

$$ \frac{\partial h2}{\partial w2}=\frac{\partial}{\partial w2}(w2*x+b2)=x=2 $$
Thus, $$ \frac{\partial L}{\partial w2}= -0.388* 0.156*0.6*0.214*2=-0.0155 $$

Now, let’s update the weight of w2:

$$ w2_{new}= w2 – \alpha* \frac{\partial L}{\partial w2}= 0.2 – 1.0*(-0.0155) = 0.2155 $$

We are now done with one pass of backpropagation! Let’s recalculate the loss using these updated weights and see if there is any improvement in loss. To do so we need to find the values of h1, h2, a1, a2, o1, and finally the a3 (remember a3 = ypred).

$$ h1 = w1_{new}* x + b1 = 0.114 * 2 + 0.3 = 0.528 $$
$$ h2 = w2_{new}* x + b2 = 0.2155 * 2 + 0.4 = 0.831 $$
$$ a1 = sigmoid(h1) = sigmoid(0.528)= 0.629 $$
$$ a2 = sigmoid(h2) = sigmoid(0.831)= 0.697 $$
$$ o1 = w3_{new} * a1 + w4_{new} * a2 + b3 $$
or, $$ o1 = 0.5376 * 0.629 + 0.6417 * 0.697 + 0.7 = 1.485 $$
and finally, $$ a3 = sigmoid(o1) = sigmoid(1.485) = 0.815 = y_{pred} $$

Our newly predicted value for y is 0.815 which is closer to actual y of 1 than our previously predicted value of 0.806. Thus, we can see that we are already heading in the right direction to get a predicted y that will be more similar to the actual y. Let’s see how much loss we get with this new value of ypred.

$$ L = (y_{pred}-y)^2= (0.815-1)^2=0.034 $$

Yes! Our loss is decreasing already! Previously the loss was 0.0376 and now we have a loss of 0.034. We could run one iteration of backpropagation and successfully reduce the overall loss in the process. If we run backpropagation for more iterations then loss will decrease even more and we will end up with a predicted value of y which will be very close to the actual value of y. I have run it in python for more iterations and the following are the results: after running for 100 iterations our predicted value becomes 0.9576 and the loss becomes 0.0018. If we run for 1000 iterations then the predicted value becomes 0.9881 and the loss becomes 0.0001. Thus, we can see the model starts to learn very well by adjusting the weights so that it can predict a value that is very very close to the true value.

Backpropagation is one of the most fundamentally important algorithms in machine learning. As it powers neural networks which is the core component of AI, understanding backpropagation is very important for anyone trying to understand how AI works.