# Detailed Introduction to Recurrent Neural Networks

All those who have been intrigued by Artificial Intelligence and related fields are well aware of neural networks and their applications. But one of the main disadvantages which neural networks suffer from is it cannot store information which is the primary requirement when we try to do text prediction, language translation, image captioning etc. To overcome such issues, we have Recurrent Neural Networks which are helping those tasks where prediction is based not only upon the current input but also on the past output as well.

For more details on Neural Network read our previous article on **Neural Network.**

Now it’s time to start our journey where we will look in depth about Recurrent Neural Networks.

**In this article we’ll cover:**

- What is Recurrent Neural Network
- Working of the Recurrent Neural Network
- Backpropagation in RNN
- Disadvantages of RNN’s – Vanishing and Exploding Gradient Problem
- Introduction to LSTM Network
- Varied Recurrent Neural Networks.
- Implementation of RNN
- Applications of Recurrent Neural Networks

**What is Recurrent Neural Networks?**

We as human beings have the ability to remember things and using that we can take decisions in our mundane life. This same ability is contained in Recurrent Neural Networks.

Recurrent Neural Networks are one of the most powerful and robust types of neural networks. The fact which makes RNN distinguished is their ability to work with the internal memory.

*This figure shows an example of Recurrent Neural Networks where the output is given back to the inputs*

You must be wondering Recurrent Neural Networks must be something newly invented but this not true, they are quite old since they were first introduced in 1980’s but their actual potential has gained some noteworthy attention in recent past because of the rise in the computing power and advances in designing of the networks.

As earlier mentioned, RNN’s have internal memory which makes them capable of remembering things related to the inputs received, through this they can predict what will be next as the output and they can get a profound understanding of the sequence and the context related to it.

Due to all the above reasons, RNN’s easily manage sequential data like time series, speech, text recognition, audio/video and much more.

**Working Of Recurrent Neural Networks**

Before we understand the working of RNN, let’s discuss what sequential data is all about. The sequential data is ordered data where related things are joined together to form a series or sequence. Examples of sequential data are DNA sequence, audio/video related data, Time series data (in this type of data we have series of data points with respect to the time order) and there are many such kinds of data.

*Figure 2: Example of Time Series Data which is a type of sequential data and input to RNN*

In RNN the information is continuously looping. It combines the present input and previous input to produce the output.

From the above diagram, it is evident that in RNN the output generated is provided back to the next input for generation of further inputs. This unique ability of RNN to remember exactly what they had previously obtained helps in the prediction of various sequential data.

At the same time, we can Feed Forward Neural Networks do not have any kind of memory of the inputs obtained earlier because of which they are poor in terms of dealing with data consisting of the order of time. The only thing which they remember is the training performed on them.

In RNN, we supply weights to the two inputs i.e. current and recent past. Along with this, the change in weights which is required for reducing the error while performing the training is done through gradient descent and Backpropagation through time.

**Backpropagation through time in RNN**

After having learned the way recurrent neural networks work, it’s time to look at how the RNN’s are trained with the data which is provided for them. While training the RNN we have to answer all such questions like how do we decide the weights for each connection? How to initialize the weights for the hidden units.

For answering all these and many more such questions we use Backpropagation of error and Gradient Descent. But here comes the catch, we cannot use the backpropagation of error used for Feed-Forward Neural Networks.

*Figure 3: Unrolled Recurrent Neural Network consisting of sequences of neural networks*

The reason why we cannot use the traditional Backpropagation is that the RNN are cyclic graphs whereas feed-forward networks are acyclic directional graphs. Because of this structure of Feedforward networks, we are able to calculate the error from the above layer, but as the RNN have the different structure we cannot calculate the error with the same method.

Now for performing backpropagation in RNN, we must unroll the Recurrent Neural Network so that we are viewing it as a sequence of neural networks. In the above diagram, we can see that the right side consists of a series of neural networks where the error of the present time step depends on the previous timestep.

Therefore, Backpropagation through time (BPTT) is helping the error to be backpropagated from last to the first timestep, while we unroll the RNN. Using this method, we are able to calculate the error for each timestep and through this, we update weights and try to reduce the error with each epoch.

The disadvantage which BPTT faces is that it can get expensive in terms of computation when the number of timesteps is high.

**Disadvantages of RNN’s – Vanishing and Exploding Gradient Problem**

The two disadvantages of RNN are as follows:

**1. Exploding Gradient**

We are aware of the term Gradient from the traditional neural networks. In simple terms, gradient helps us to know how much the output of a function changes, when we change the inputs.

In RNN’s the algorithm makes the weights of significant importance without providing any reason or proof for the same. To counter this problem, we can truncate/squash the gradients.

**2. Vanishing Gradient**

In vanishing gradient when the values of the gradient are too trivial and the model’s learning is halted or the time taken for the learning purposes is too long.

For solving this issue, the idea of LSTM (Long Short-Term Memory) is used which was proposed by Sepp Hochreiter and Juergen Schmidhuber.

**Introduction to LSTM Network**

LSTM or Long Short-Term Memory Network is an extension to Recurrent Neural Network. We are well aware of the fact that RNN’s are good to use when we have sequential data but then we have also looked at the disadvantages which it faces. To overcome these issues, we have an LSTM network.

LSTM resolve both the issues and is capable of performing training over a long sequence which was not feasible in normal recurrent neural networks. The units of LSTM are used as building blocks for RNN and thus resulting in LSTM network. The LSTM networks are analogous to Computers as they can perform operations like reading, write and delete from the memory which is stored.

**Implementation of Recurrent Neural Networks**

Now it’s time to dig ourselves deep and try implement recurrent neural networks using Keras.

**Keras** is a high-level API written in python for implementing neural networks and it has the ability to run on top of Tensorflow or Theano.

**For Installing Keras you can opt for the following code:**

Before installing Keras, we have to install Tensorflow. This is because Keras acts as a wrapper around Tensorflow/Theano for building much more complicated models of Deep Learning and Keras provides an easy interface to interact with.

1 2 |
pip install tensorflow |

**Now installing Keras with the following code:**

1 2 |
pip install keras |

It’s time to commence the implementation of our very own recurrent neural network model.

**Step 1:**

1 2 3 4 5 6 7 |
import numpy as np from keras.models import Sequential from keras.layers import Dense from keras.layers import Dropout from keras.layers import LSTM from keras.utils import np_utils |

Here we begin with importing the libraries which are required. We have imported numpy library for performing mathematical operations, structuring the input, output and data labels.

After this, we have imported some specific functions of Keras used for building our RNN. The specific role of different functions will be discussed in later steps.

**Step 2:
**For performing the actual code implementation, we will require the input data. Here the input data is a monologue from Othello. You can get the text file from here. Remember to save the text file in the same directory where the python/Jupyter notebook is kept.

1 2 3 4 5 6 7 8 9 |
#Reading the data and converting it into lower case data = open("Othello.txt").read().lower() #Now we will sort the data which is obtained in the form of list chars = sorted(list(set(data))) #Now we are counting the total number of characters totalChars = len(data) #Number of unique chars numberOfUniqueChars = len(chars) |

The input data which is available is in the form of text and we want to convert it into the form compatible with Keras. For the same reason, we are converting the text into lowercase which is a form of normalization.

After this, we are creating sorted () list of character in the text and store number of characters in the dataset in the ‘totalchars’. Lastly, we are storing the length of characters which will consist the number of unique characters.

**Step 3:
**For representing each character in for numbers, we will be using dictionaries for the same.

1 2 3 4 5 6 7 8 9 |
#For better results we are assigning Id to each character CharsForids = {char:Id for Id, char in enumerate(chars)} #This is the opposite to the above idsForChars = {Id:char for Id, char in enumerate(chars)} #Here we are deciding the number of characters learned i.e. Timestep numberOfCharsToLearn = 100 |

First, we are creating the dictionary where the character is the key and each key character is represented using a number. In the next line of code, we are doing the opposite of the previous line. Lastly, we are deciding the number of characters to be trained in a one-time step i.e. one training example.

**Step 4:
**The following counter is used for looping 100 times (number of characters learned).

1 2 3 4 5 |
#Since our timestep sequence represetns a process for every 100 chars we omit #the first 100 chars so the loop runs a 100 less or there will be index out of #range counter = totalChars - numberOfCharsToLearn |

Now we have created empty lists for storing formatted data in the form of input ‘charX’ and output ‘y’.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
#Here we are storing the input data charX = [] #Here we have stored the output data y = [] #This loops through all the characters in the data skipping the first 100 for i in range(0, counter, 1): #This one goes from 0-100 so it gets 100 values starting from 0 and stops #just before the 100th value theInputChars = data[i:i+numberOfCharsToLearn] #If we do not use ':' we start with 0, and so we get the actual 100th value #Essentially, the output Chars is the next char in line for those 100 chars #in X theOutputChars = data[i + numberOfCharsToLearn] #Through this we are adding/appending 100 chars as an Id to list charX charX.append([CharsForids[char] for char in theInputChars]) #For every 100 values there is one y value which is the output y.append(CharsForids[theOutputChars]) |

‘theInputChars’ is used for storing the first 100 input characters and after this loop is repeated as it takes the next 100 input characters and this loop is continued for the rest of the inputs.

‘theOutputChars’ stores only one character which is the next character after the final character in ‘theInputChars’

Lastly, in the ‘charX’ list we are appending 100 integers which are trained in the iteration and these integers are representing the ID’s of characters which were the input. Along with this, the integer ID is also appended to the ‘y’ list which is the output of single character in ‘theOutputChars’.

**Step 5:
**After the above steps, we want the data to be an incorrect form for Keras.

First, we are shaping the input array where the three parameters represent ‘samples’, ‘time-steps’ and ‘features’, this form is necessary for Keras.

1 2 3 4 5 |
#Len charX represents how many of those time steps we have #Our features are set to 1 because in the output we are only predicting 1 char #Finally numberOfCharsToLearn is how many character we process X = np.reshape(charX, (len(charX), numberOfCharsToLearn, 1)) |

For efficient and effective results we are normalizing the data.

1 2 3 4 5 6 7 |
#For normalizing X = X/float(numberOfUniqueChars) #This sets it up for us so we can have a categorical(#feature) output format y = np_utils.to_categorical(y) print(y) |

**Output:
**This is the categorical form of output.

Now we are transforming the ‘y’ into a one-hot vector. The **one-hot vector** is an array of 0’s and 1’s. In this vector 1’s occur only at positions where the ID is found to be true. This same process is followed for 100 different examples having the length of ‘numberOfUniqueChars’.

**Step 6:
**Finally, it’s time to build our Recurrent Neural Network model.

1 2 3 4 5 6 7 8 9 10 11 12 13 |
model = Sequential() #Since we know the shape of our Data we can input the timestep and feature data #The number of timestep sequence are dealt with in the fit function model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2]))) model.add(Dropout(0.2)) #number of features on the output model.add(Dense(y.shape[1], activation='softmax')) model.compile(loss='categorical_crossentropy', optimizer='adam') model.fit(X, y, epochs=100, batch_size=128) model.save_weights("Othello.hdf5") # model.load_weights("Othello.hdf5") |

In the above code snippet, **Line 1 **uses the Sequential () which is imported from Keras. It is used for creating an empty template model used for building RNN.

In the next line, we add the first layer to the empty template model. This layer is LSTM layer containing 256 units and ‘input_shape’ as its one of the parameters.

‘Dropout’ import ensures that overfitting which occurs frequently in RNN is restricted to minimum. For restricting overfitting, this imported function randomly selects neurons and ignores them during training. Here ‘Dropout’ is provided with a parameter as ‘0.2’ which means 20% of the neurons will be dropped.

‘Dense’ is used for getting the output in the form of a layer of any neural network/recurrent neural network.

Using the ‘add’ function from the import, the available layer acts an output layer. Here we use the activation of the dot of the weights and inputs along with bias.

Now in configuration settings, we have loss function with parameter as ‘categorical_crossentropy’ and optimizer is ‘Adam’.

With the fit () function, we will run the training algorithm. Here the epochs are specifying the number of times the batches are to be evaluated. In this tutorial the number of epochs is taken as 100 and if you want you can change the number of epochs and see what different results you can get. With the batch size number, we are specifying the number of input data set we want to evaluate. For this practical, it is set as 128 i.e. first 128 examples are going as input then next 128 and this continues for the whole data set.

At last, the training is completed and we can save the weights. Moreover, we can also load the previously trained weights.

**Output:
**The following is the output which is expected when we start the training process, it will take some time. You can see that each of the 100 epochs is executed and the error i.e. loss value is continuously decreased which is a good sign.

Once this computing is finished, the code will now jump to the code for predicting the text. So let’s have a look at it.

**Step 7:
**Initially, in

**Line 1**we are generating random values from 0 to one less than the length of input data. In

**Line 2**we get the starting sentence in form of integer form.

With **Line 3** we are initiating a loop for 500 times, this value can be changed to see the different results, the value 500 means we are generating 500 characters through this loop.

Using **Line 4 **we are generating a data example used for predicting the next character. After normalizing in **Line 5**, we supply it to the prediction model in **Line 6.**

1 2 3 4 5 6 7 8 9 10 11 |
randomVal = np.random.randint(0, len(charX)-1) randomStart = charX[randomVal] for i in range(500): x = np.reshape(randomStart, (1, len(randomStart), 1)) x = x/float(numberOfUniqueChars) pred = model.predict(x) index = np.argmax(pred) randomStart.append(index) randomStart = randomStart[1: len(randomStart)] print("".join([idsForChars[value] for value in randomStart])) |

In **Line 7 **we get the index of the next predicted character after that sentence. In **Line 8** and **9 **we are appending the character predicted to the starting sentence which gives 101 characters and so we can omit the first character to get the next 100 characters.

Lastly, we are looping until 500 and printing out the generated text by conversion of ID’s to the designated characters.

**Output:**

This is the output text which is predicted through this model, it will vary as per the changes which you will make in the model. The output generated in the form of prediction is not very convincing, but as our model is very basic we can expect this result.

**Applications of Recurrent Neural Networks**

**Natural Language Processing
**There have been numerous models built by people which are able to represent a language model. And these models have been capable of generating poems on the basis of large inputs which was provided to it in the form of poems itself.

**Language Modelling and Generating Text
**Prediction of texts and words is done using the input sequences of text/sentences.

**Language Translation
**Many famous applications like Google Translate, Duolingo have been using this technique of translation of one language into another language.

**Image/Video Tagging
**Here we identify different images present in each frame of an image/video. For its implementation, RNN is combined with CNN and then we get the desired output but this application is still in its early days.

I hope you have enjoyed and learned a lot from this article.