# Gradient Descent: Relating It With Real Life Analogies

Accuracy! That is one word having the most concern when dealing with Machine Learning or AI Problems. Most of the problems that people deal with are real-world, so we can’t overlook the error rates of the model and compromise instead. In this article, I will take you through a concept of Gradient Descent and its analogies with real-life machine learning and AI

Consider the case of self-driving cars. There is no one inside looking for pedestrians and thus operating the car accordingly rather what happens is, the model fitted in the car itself detects them and drives appropriately, pulling on the brakes, slowing down the speed or whatsoever action is needed. Now, imagine that the fitted model isn’t accurate; hence it won’t be able to watch out pedestrians or fellow cars and would end up crashing several lives at risk.

**Need for Optimizing Algorithms**

*How will we be able to evaluate our model? How can we judge whether it is performing according to our need or not?* This evaluation is done by the calculation of a *COST FUNCTION*, which is just a mapping function that will tell us the difference between the desired result and what our model is computing. Once a cost function is computed, it is *our *duty to correct the model whenever it does something undesirable.

Think of your model as your own child; so whenever your child (model) will do anything wrong, you’d correct him until a time comes that he can distinguish between the do’s and don’ts. If talking in sense of model, then this time will come when the accuracy would become what is actually required. This correction is done using *Optimization Algorithms.*

As the initial cost function for the model is computed, it is judged and hence optimized with the help of these Optimization Algorithms. This cycle of computing and optimizing the cost function goes on until the desired accuracy level is reached. So, optimization algorithms are helpers which minimize the cost function (or error rate) thus maximizing the accuracy. And talking about the Optimization Algorithms, *GRADIENT DESCENT* is the name which comes at the very first step of your long journey. What is it?

**Gradient Descent: Introduction**

Gradient Descent is the basis for more powerful optimizing algorithms which are currently being used in Deep (as well as Machine) Learning. So, it is necessary to lay the foundation strong, right?

Gradient descent is an optimization algorithm which iteratively finds the values of learnable parameters of a function (f) to minimize the cost function (or error rate).

It is also known as the First-order iterative optimization technique as it computes the first-order derivative of the cost function wrt the learnable parameters (say weights).

**Relating Gradient Descent with real-life analogies:**

- Think of a valley which you want to descend. The game is that you’re blindfolded. What a sane human might do is, move a step and check for the slope of the valley i.e. whether it is going up or down. Then, proceed to follow the downward slope of the valley, and repeat the same step again and again until you reach the minima!
- Another highly-used analogy is, suppose you have a ball which is placed on an inclined plane. According to the laws, it will roll until it finds a gentle plane, where it will ultimately stop.

That exact situation happens in gradient descent, the inclined and/or irregular is the cost function when plotted. And the job of the gradient descent is to provide the idea of direction and velocity of the movement to reach the optima or minima of the function, where the cost is least.

**How does Gradient Descent Works?**

So now, hope you know the task from the analogies we just went through. It is to reach the minimum value of the cost function. Let us now see how it is actually accomplished and is implemented.

As the cost function is computed for random values of learnable parameters (weights) in the very first step; it is then evaluated with the help of gradient descent. What happens at this stage is, a derivative of the cost function is calculated with respect to each input, learnable parameter. The sign of the derivative (whether +ve or -ve) decides in which direction next step is to be taken while updating the parameters.

After computing the gradient /derivative, the parameters are updated as follows:

**PARAMETER = PARAMETER – (LEARNING_RATE * GRADIENT)**

- LEARNING_RATE (denoted by α) decides the size of steps which the algorithm will take while descending the cost function. From the figure below, you can infer that choosing an appropriately smaller α would be wise. In case you chose a larger value, then the situation on the left
*(in the figure)*will arise hence the function would now be able to converge. - The GRADIENT is the derivative of the cost function wrt the parameter being updated from the above equation,if the sign of gradient/derivative is negative, the value parameter is increased thus, the function will move to the right of the slope (
*Initially being on the left side of the minima*). Similarly,if the sign is positive (*i.e. the initial position is on the right of minimal**as in the figure*) then the value of the parameter will decrease thus moving towards the minima at the left.

**Forms of Gradient Descent**

On the basis of the number of instances (training examples) being looked over, before updating the parameters, the Standard Gradient Descent can be classified as:

- Batch Gradient Descent
- Stochastic Gradient Descent
- Mini-batch Gradient Descent

The *Batch Gradient Descent is the Standard Gradient Descent Technique in which the algorithm will* calculate the gradient of the whole dataset and *will perform only one update. Hence for large datasets, it can be too time as well as space consuming.*

Stochastic Gradient Descent, on the other hand, updates the parameters for **each training example**. It is hence evaluated as a much faster technique. However, *due to the repeated updates, fluctuations are noticed when the training/learning of the model is plotted with time (in the figure). These fluctuations keep on overshooting the algorithm instead of attaining the global minima.*

**In Mini-batch Gradient Descent** (most useful out of the other two), batches of n training examples are made. The gradient thus goes through one batch at a time and performs an update for each. The size of a batch is decided by the programmer while understanding the size of the data. It is however found in the studies that it should be of the order 2^{n} for fast and efficient computations.

**What all We learned?**

- Why evaluation of the model is necessary
- How model evaluation is done using Cost Function
- The Need for Optimization Algorithms
- Introduction to Gradient Descent
- The gist of Derivatives, Global Minima, Slope Finding
- How the updating process goes:
- Random initialization
- Cost (function) computation
*Slope/Gradient/Derivative*finding using Gradient Descent- Updation of parameters
- Repeat processes B to D until global minima reached

- Types of Gradient Descent:
- Batch GD or Standard GD
- Stochastic GD
- Mini-batch GD, which is being used over the other two variants

**This Example Depicts The Accuracy And Cost Function Rate For The Three Variants Of Gradient Descent: i.e. Batch, Stochastic, Mini-Batch**

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
### Importing datasets and important libraries ### import numpy as np import matplotlib.pyplot as plt import math import sklearn import sklearn.datasets from opt_utils import load_params_and_grads, initialize_parameters, forward_propagation, backward_propagation from opt_utils import compute_cost, predict, predict_dec, plot_decision_boundary, load_dataset ### Setting up the frame size for plotting in future ### %matplotlib inline plt.rcParams['figure.figsize'] = (7.0, 4.0) # setting default plot size plt.rcParams['image.interpolation'] = 'nearest' plt.rcParams['image.cmap'] = 'gray' |

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
def update_params_with_gd(params, grads, learning_rate): """ Update parameters using one step of gradient descent Arguments: params -- python dictionary containing your parameters to be updated: parameters['W' + str(l)] = Wl parameters['b' + str(l)] = bl grads -- python dictionary containing your gradients to update each parameters: grads['dW' + str(l)] = dWl grads['db' + str(l)] = dbl learning_rate -- the learning rate, scalar. Returns: parameters -- python dictionary containing your updated parameters """ L = len(params) for l in range(int(L/2)): params['W' + str(l+1)] = params['W' + str(l+1)] - learning_rate * grads['dW' + str(l+1)] params['b' + str(l+1)] = params['b' + str(l+1)] - learning_rate * grads['db' + str(l+1)] return params |

**Figure showing the different approach of Batch, Stochastic and Mini-batch Gradient Descents**

**Figure 1**: **SGD vs GD**

“+” denotes a minimum of the cost. SGD leads to many oscillations to reach convergence. But each step is a lot faster to compute for SGD than for GD, as it uses only one training example (vs. the whole batch for GD).

**Figure 2**: **SGD vs Mini-Batch GD**

“+” denotes a minimum of the cost. Using mini-batches in your optimization algorithm often leads to faster optimization.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
def random_mini_batches(X, Y, mini_batch_size, seed = 0): """ Creates a list of random minibatches from (X, Y) Arguments: X -- input data, of shape (input size, number of examples) Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (1, number of examples) mini_batch_size -- size of the mini-batches, integer Returns: mini_batches -- list of synchronous (mini_batch_X, mini_batch_Y) """ np.random.seed(seed) m = X.shape[1] mini_batches = [] # Shuffling permutation = list(np.random.permutation(m)) shuffled_X = X[:, permutation] shuffled_Y = Y[:, permutation].reshape((1,m)) # Partitioning num_complete_minibatches = math.floor(m/mini_batch_size) for k in range(0, num_complete_minibatches): mini_batch_X = shuffled_X[:, k * mini_batch_size: (k+1) * mini_batch_size] mini_batch_Y = shuffled_Y[:, k * mini_batch_size: (k+1) * mini_batch_size] mini_batch = (mini_batch_X, mini_batch_Y) mini_batches.append(mini_batch) # Last minibatch if (m % mini_batch_size != 0): mini_batch_X = shuffled_X[:, num_complete_minibatches * mini_batch_size : ] mini_batch_Y = shuffled_Y[:, num_complete_minibatches * mini_batch_size : ] mini_batch = (mini_batch_X, mini_batch_Y) mini_batches.append(mini_batch) return mini_batches |

Loading Model

Loading Model

1 2 |
train_X, train_Y = load_dataset() |

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 |
def model(X, Y, layers_dims, optimizer, learning_rate = 0.0007, num_epochs = 10000, print_cost = True): """ 3-layer neural network model which can be run in different optimizer modes. Arguments: X -- input data, of shape (2, number of examples) Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (1, number of examples) layers_dims -- python list, containing the size of each layer learning_rate -- the learning rate, scalar. num_epochs -- number of epochs print_cost -- True to print the cost every 1000 epochs Returns: parameters -- python dictionary containing your updated parameters """ L = len(layers_dims) costs = [] t = 0 seed = 10 # Initializing params params = initialize_parameters(layers_dims) if (optimizer == 'bgd'): print("Costs for Batch Gradient Descent") mini_batch_size = X.shape[1] elif (optimizer == 'sgd'): print("Costs for Stochastic Gradient Descent") mini_batch_size = 1 elif (optimizer == 'mbgd'): print("Costs for Mini-Batch Gradient Descent") mini_batch_size = 64 # Optimization Loop for i in range(num_epochs): seed = seed + 1 minibatches = random_mini_batches(X, Y, mini_batch_size, seed) for minibatch in minibatches: # selecting a minibatch (minibatch_X, minibatch_Y) = minibatch # forward prop a3, caches = forward_propagation(minibatch_X, params) # compute cost cost = compute_cost(a3, minibatch_Y) # back prop grads = backward_propagation(minibatch_X, minibatch_Y, caches) # update params params = update_params_with_gd(params, grads, learning_rate) if (print_cost and i % 1000 == 0): print("Cost after epoch %i : %f" % (i, cost)) if (print_cost and i % 100 == 0): costs.append(cost) plt.plot(costs) plt.ylabel("cost") plt.xlabel("epochs (per 100)") plt.title("Learning rate = "+str(learning_rate)) plt.show() return params |

**Batch Gradient Descent**

1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
# train 3-layer model layers_dims = [train_X.shape[0], 5, 2, 1] parameters = model(train_X, train_Y, layers_dims, optimizer="bgd") # Predict predictions = predict(train_X, train_Y, parameters) # Plot decision boundary plt.title("Model with Batch Gradient Descent optimization") axes = plt.gca() axes.set_xlim([-1.5, 2.5]) axes.set_ylim([-1, 1.5]) plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y) |

**Stochastic Gradient Descent**

1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
# train 3-layer model layers_dims = [train_X.shape[0], 5, 2, 1] parameters = model(train_X, train_Y, layers_dims, optimizer="sgd") # Predict predictions = predict(train_X, train_Y, parameters) # Plot decision boundary plt.title("Model with Stochastic Gradient Descent optimization") axes = plt.gca() axes.set_xlim([-1.5, 2.5]) axes.set_ylim([-1, 1.5]) plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y) |

**Mini-batch Gradient Descent**

1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
# train 3-layer model layers_dims = [train_X.shape[0], 5, 2, 1] parameters = model(train_X, train_Y, layers_dims, optimizer="mbgd") # Predict predictions = predict(train_X, train_Y, parameters) # Plot decision boundary plt.title("Model with Mini-batch Gradient Descent optimization") axes = plt.gca() axes.set_xlim([-1.5, 2.5]) axes.set_ylim([-1, 1.5]) plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y) |

**Download** the Full Code Snippet

**Explanation:**

In usual cases, the mini-batch gradient descent optimizer performs much better as compared to the batch or stochastic gradient descent, but in this case, as you can see the number of epochs (10000) is way greater than usual therefore the model is overfitting when using Stochastic GD. And that is why the accuracy of the model with Stochastic GD optimizer is 94.67%.

What’s happening here is, the system is looking at the same training instance/example several times and then finally moving to the next instance. This way the system knows each and every pattern of the instances, or we can say it has mugged it all up instead of just learning the concept. This is obviously not desired in the usual scenarios, so even if the accuracy is 94.67% it is not said to be a good model.

However, in the case of the batch or mini-batch gradient descent, the accuracy is what it will usually give off in almost all cases. So you can infer from this that the Mini-Batch GD optimizer with an accuracy of 79.67% wins over Batch GD optimizer which has an accuracy of 66% here.