# Neural Networks Workflow

Neural Networks are state-of-the-art architectures which are used in the industry. Whether you talk about speech recognition, caption generation, text summarization or translation, or even image classification, it is laying its hands on almost everything.

But to operate with anything, there is always a *rulebook *which one must follow. And in the case of neural networks, the set of rules are just simple, follow the basic workflow of the network! This post features the neural networks’ workflow and also some certain mathematics associated with each flow of control, so let us get started.

If you are not familiar with what are Neural Networks and what are they doing here, then I’d suggest you follow our previous articles and come back soon. Though I’ll try to keep it as basic as I can as a prerequisite I assume that Y’all know Deep Learning [link to the previous blog], and have a gist of neural networks.

**Starting Off With What Are Neural Networks (Mathematical Analogy)**

As stated in the last post as well, Neural Networks is an interconnection of several of neurons, or nodes, which have weights [W/ Ө], biases [b], and activations [f(x, y)] associated with it. This interconnection enables the neurons to exchange data in the form of real numbers.

Each neuron is a function which gets some input, and it returns an output associated with it, which is calculated by an activation function residing inside. Now say that there is a network with 100 neurons, just imagine how many inputs and outputs will be in this network!

(I have already explained the terms *weight, bias, *and *activation function *in the last post here.)

**Neural Nets Workflow**

The pipeline associated with the architecture of a neural network goes as follows –

- Forward Propagation
- Cost Computation
- Optimization Algorithm to minimize this cost function
- Backward Propagation

Let us see each in detail as to how does it work!

**Forward Propagation**

This is the very first step of the Neural Architecture pipeline which takes in the output decides some values for the learnable parameters i.e. weights and biases and generates some output.

- Starting with the weight and bias unit, each is randomized by the programmer; mostly the bias is set to be zero and weight to be any random number. This is setting up of parameters is done for each neuron in the input layer. [Learn about input and other types of the layer in the previous post.]
- After randomly initializing the learnable parameters, the input is fed into the input layer of the network. So the input layer is a vector having
*n*number of nodes where*n*is the number of features which are distinguished from the dataset.- From the last post, there say three features. One containing a categorical column for the problem statement in one word, another for the ratings of UI from the user, and third for the background of the crowd which is engaged, another categorical column containing background in the single word.
- Hence the number of nodes in the input layer would be three for each feature.
**Don’t confuse between the above picture and a neural network doing the stated task. The above figure was made just for a reference, while what we are talking about, right now is a neural network having 3 nodes in the input layer and 1 in the output while hidden units depend upon the choice of the programmer.**

- As the input is fed and the parameters are initialized, the next task is to apply a choice of activation function for each layer. Using this function, an approximate mapping of the input with the output is done. (Approximate because if you’d recall, the parameters are randomly initialized in the first step, so an exact mapping can’t be achieved randomly). However, as the learning proceeds i.e. as the flow goes into deeper layers, the learnable parameters starts changing and moving towards the value which indeed is desired.
- For example, “
*Y =**Ө*is the function, so after setting up random values for parameters, a multiplication of input values with the weights of that layer is done and then bias parameter is added to this term. This generates up to a numerical output which serves as the input of the next layer. So for the next layer, the value of*.X + b”*is decided by this function.*X*

- For example, “
- As the activation function is applied to the input layer, it generates some output. Now, this output acts as an input for the first hidden layer. And the process goes on and on for deeper layers until the final
*output layer*is reached. Now, this output layer to has an activation function which generates the final output of the network, which is used to check the performance of the model.

**Cost Computation**

The programmer when deploys a neural network, he/she has some desires with the model like what will be the output? Will it appropriately sort the problem statement? If yes, how accurate would it be?

The machine learning problems often have original results with which the model’s performance is judged (supervised learning). Or otherwise as stated before, there are some expectations associated. And that is how the above questions can be answered. A comparison is done between the values what the model outputs (or let us predicts) and what are the desired outputs. This exactly is done via the **Cost Function. **

A cost function is used to **evaluate the performance** of the model. It outputs a single real value which denotes accuracy of the model (and takes in the original result and model-generated output in case of supervised learning). So now the next pipeline’s objective is to minimize this cost function so as to gain a higher accuracy score.

One such cost function is mentioned beside, which is known as Root Mean Square Error. Here, P and Q denote the predictions of the model and actual results, and n is the total number of instances (or training examples) available.

**Optimizing the Cost Function**

I’d like to take an analogy to explain this “optimizing the cost function” pipeline, which I really connect to and hope that you too will.

The machine learning model is often referred to a kid who is left in the world to learn from his mistakes and sometimes corrected by his guardian when he makes one. When a model is just deployed for training, it is like a kid made to roam around in a park on his own, whose objective is to reach to an ice cream truck standing in the corner. And whenever he deviates from the path which leads to the truck (i.e. the accurate output in model’s case), a guardian corrects him and points him again and again while deviating until he finally reaches the goal.

Now in NN’s case, this optimization algorithm acts as a guardian for the model as it does exactly the same thing of correcting it whenever it goes wrong. The aim of the optimization algorithms is to minimize the cost function and thus increase and accuracy of the model significantly. The most basic and simplest algorithm which is used for neural nets is ** Gradient Descent**.

However, there are various other techniques (as well as various variants of gradient descent) which could be equipped for your baby model like Adagrad, RMSprop, Adam, Adadelta, as well as various variants of gradient descent as well [check out the post here].

**Backward Propagation**

After the forward pass, computation of cost function, etc. is done, then comes the part of updating the learnable parameters (remember we initialized them randomly in the forward pass?). This update is done using a technique known as *Backward Propagation* (or simple Backprop).

In the optimization pipeline, the **derivative of the cost function is computed with respect to each parameter involved in the model** (i.e. every weight parameter, every bias parameter, and every activation function associated. This derivative tells us how much change must be brought in a particular variable so as to increase the accuracy of the model, or in other words, decrease the cost function. Hence the *parameter which is interfering more in the cost function is penalized more, due to this derivative approach*.

As the name suggests, backward pass operates in the opposite direction of the forward propagation i.e. in the very first step the derivative of the cost function is computed wrt the recent layers’ activation, then its weights and bias. Then the control shifts to the layer previous to the current one do the same for that and move similarly until it finally reaches the very first layer. This was each parameter is updated and then the cycle reruns from the forward pass again but this time with *updated* parameters!

**Topics Covered**

- Forward Propagation
- Cost Computation
- Optimization Algorithm to minimize this cost function
- Backward Propagation

Deep learning is obviously an iterative process as quoted by Andrew NG but you cannot afford to struggle with just the basic workflow. Now, there you go, try your hands-on. Make a neural network from scratch! It may be a hectic task but it will really help you understand the idea even better. Meanwhile, stay connected for I’ll be covering to make and deploy your own neural network in just 20 lines of code 😉