Beginner’s Guide to Reinforcement Learning

Reinforcement Learning or RL is a type of machine learning technique that enables an agent to learn in an interactive environment by trial and error using feedback from its own actions and experiences. Well, we all are aware of the most faithful friend which we humans have i.e. Dogs. We can learn a lot from them while playing with them, but what if I tell you that Dogs are a perfect example of Reinforcement Learning. Don’t believe me, then let’s get ready to learn about what actually Reinforcement Learning is and how dogs showcase it.

In this introduction to Reinforcement Learning, we’ll walk through

  1. What is Reinforcement learning in simple words?
  2. The components of Reinforcement Learning problem.
  3. Distinguishing between Reinforcement learning, Supervised and Unsupervised learning.
  4. Algorithms used for implementing RL.
  5. Practical implementation of Reinforcement learning.
  6. Ways used for learning.
  7. The disadvantage of Reinforcement Learning.
  8. Applications of Reinforcement Learning around us.
  9. Real world implementation of Reinforcement Learning.

Reinforcement Learning in Simple Words

Reinforcement Learning is learning the best actions on the basis of rewards and punishment. But when we wear our technical goggles, then Reinforcement Learning is defined using three basic concepts i.e. states, actions, and rewards.

Here the “state” defines a situation in which an agent is present who performs some “actions” and based upon these actions the agent receives either rewards or punishment.

Figure 1: Dog playing with the owner
Fig 1: Dog playing with the owner

When we consider the example of the dog, there we have the owner of the dog and the “dog” (Agent) itself. Now when the owner of the dog is present in the garden with the dog, he/she throws away a ball. This throwing away of the ball is the “state” for the agent and now the dog will run after the ball which will be the “action”.

The result will be an appreciation or food for the dog from the owner which will be “reward” as a result of the action and if the dog does not go after the ball another alternate action then it may get some “punishment”. Therefore, this is what Reinforcement Learning is all about. Next, we’ll understand the terminology which Reinforcement learning comprises of.

Components of Reinforcement Learning Problem

Now for each and every Reinforcement Learning problem, there are some predefined components which help in better representation and understanding of the problem. The following are the components:-

Fig 2: Reinforcement Learning Framework
Fig 2: Reinforcement Learning Framework

Agent: Agent takes actions; as mentioned earlier in our example, the dog is the agent

Action (A): The agent has set of actions A from which it selects which action to perform. Just like the dog who decided whether to go after the ball, just look at the ball or jump at the position.

Discount Factor: The discount factor is multiplied with the future rewards as discovered by the agent to reduce the effect of the agent’s choice of action. To simplify this, through discount factor we are making the future rewards less valuable than immediate rewards. This makes the agent look at short-term goals itself. So lesser the value of discount factor the more insignificant future rewards will become and vice versa.

Environment: It is the surroundings of the agent in which it moves. In the dog example, the environment consists of the owner and the garden in which the dog is present. It is the environment which gives the agent its rewards as an output based upon the agent’s current state and action as inputs.

State: A state is an immediate situation in which the agents finds itself in relation to other important things in the surroundings like tools, obstacles, enemies and prizes/rewards. Here the dog is required to

Reward(R): The reward is the output which is received by the agent in response to the actions of the agent. For example, the dog receives dog food as a reward if the dog (agent) brings back the ball otherwise it receives scolding as a punishment if it does not wish to do so.

Policy: Here policy is the strategy which agent uses to determine the actions which should be taken on the basis of the current state. Basically the agent’s maps states to actions i.e. it decides the actions which are providing the maximum rewards with regards to states. Talking about the dog example, when the dog comes to know that dog food will be given as a reward if it brings back the ball, keeping this in mind the dog will create its own policy to reap maximum rewards.

Markov Decision Processes (MDP’s) are mathematical frameworks to describe an environment in reinforcement learning and almost all RL problems can formalize using MDP’s.

Basically, MDP’s consist of a set of finite environment states S, a set of possible actions A(s) in each state, a real-valued reward function R(s) and a transition model as well.

Differences Between Reinforcement Learning, Supervised and Unsupervised Learning

All those who possess some basic knowledge of Artificial Intelligence would be well aware of the terms Supervised and Unsupervised learning. Similarly, Reinforcement learning has been the buzzword in the field of AI and its implementations have gained huge popularity. For example, The very famous AlphaGo was developed using Reinforcement learning by Google Deepmind, which went on to defeat the World Champion “Lee Sedol” of the Game Go.

Now you must be wondering why supervised learning or unsupervised learning was not used. So let’s look at the areas where Reinforcement learning is better as compared to the other two methods.

Supervised vs. Reinforcement Learning

Supervised Learning gets its name from the usage of an external supervisor who is aware of the environment and shares the same knowledge with the agent for accomplishing the task. In general supervised learning is like learning from tasks which have been already completed and as an agent you have to obtain the experience from this. But in some cases, there are no tasks from which any experiences can be gained and thus we cannot have any supervisor.

Since the game of Go has to move counts in billions we cannot create a knowledge repository and thus, the only option left is playing more and more games to gain experience and extract knowledge from it.

So both supervised learning and reinforcement learning we are mapping between input and output but in reinforcement learning the reward function acts as the feedback or experience which is in contrast to supervised learning.

Source Internet
Source Internet

Unsupervised vs. Reinforcement Learning

In unsupervised learning, there is no concept of mapping between input and output, unlike reinforcement learning. In unsupervised learning, our main aim is to find the hidden patterns. For example, most of the recommendation systems like movie recommendation, news articles use unsupervised learning for the same. So in this, we are building a knowledge graph on the basis of constant feedback which the customer provides by liking particular movies/articles and then similar things are recommended.

Supervised and Unsupervised Machine Learning Algorithms Read

Algorithms used for Implementing RL

Reinforcement learning along with its fundamental concepts needs to be implemented practically and for that, we use the following algorithms. Let’s have a look at those algorithms:


Q learning is the most used reinforcement learning algorithm. By the usage of this algorithm, the agent learns the quality (Q value) of each action (i.e. policy) based on how much reward the environment returns with.

Q Learning uses the table to store the value of each environment’s state along with the Q value.

SARSA (State-Action-Reward-State-Action).

SARSA resembles Q-learning to a lot extent. The only difference between the two is that SARSA learns the Q-value based on the action performed by the current policy as compared to Q-learning’s way of using greedy policy.

Practical implementation of Reinforcement learning

Now we will have a look at one of the basic implementation of Reinforcement Learning using the OpenAI Gym library. The Gym compares the different algorithms of Reinforcement Learning.

The gym provides us with a variety of test problem i.e. environments, all which can be used to know more about our reinforcement learning algorithms.

Installation of Gym

Before starting to work with Gym, we need to install gym using pip:

pip install gym

Directly from iPython notebook.

!pip install gym

After this, we are ready to start.

Step 1:

First, we are importing the gym library which we had installed earlier. Then we are using one of the inbuilt environments of Gym i.e. CartPole. This environment will display a pole trying to balance on a cart which is moving left and right.

Step 2:

Here we have created a function which will take care of the action which should be taken by the agent on the basis of state and environment. Using this function, our main aim is to maintain the pole present on the cart should try to balance and not fall down. So if the pole bends more than a given angle we are returning 0 as a result.

Step 3:

In this total list, we are storing rewards which will be collected

Step 4:

Here in this loop, we are calculating the rewards obtained in each episode by going over the loop. Each loop starts with an observation value which has been reset using reset () function.

Step 5:

Here we are running an instance of CartPole environment for 1000 timesteps, which will fetch the environment each time. There will be a small popup window displaying cart-pole using the render () function. Along with this, we are deciding the current action based upon the observation obtained by the agent’s previous actions by calling the basic_policy () function.

Now next we have used step () function which will return four values which are observation: object type (this will tell about the observation of the environment), reward: float type (amount of reward received by previous action), done: Boolean type (this tells about whether the episode has terminated or not), and info: dictionary type (the information provided by this dictionary is used for debugging and also for learning about the environments). So in this loop, at each timestep, the agent chooses an action and the environment returns an observation and a reward.

Lastly in the loop once the done variable returns the value as “true”, then we come out of the loop and append the episode_rewards value to the totals[].

Step 6:

Finally, we are printing the totals [] list which has the maximum rewards values and along with this, we are printing the maximum reward obtained. Most importantly we are using close () function to close the pop-up window otherwise, the program may crash.

The code of this implementation can be found here

Ways Used For Learning

To implement Reinforcement learning we need to have some predefined method of learning i.e. how the agent will be understanding which action should be taken to maximize the rewards.

For the above-mentioned reason, we have two methods used for learning which are as follows:-

Monte Carlo

In this method, the agent completes the episode (i.e. reaches a “terminal state”) and then looks at the total rewards to see how well it has performed. Here in the Monte Carlo method, the rewards collection is done at the end of the episode and then on the basis of the result, the maximum expected future reward is calculated.

For example: If we understand through the dog example, here the agent i.e. dog will be using Monte Carlo approach and completing the action of bringing the thrown ball back and then analyze the rewards which it received. On the basis of rewards, the dog will decide which actions should be performed in near future to maximize the reward.

Temporal Difference Learning (TD Learning)

When we look at the TD Learning method, here the rewards obtained are analyzed after each step and then on the basis of this only maximum expected future reward is calculated. Therefore, after each step, the agent decides which action should be taken to get maximum rewards.

For example: Again using the dog example, in this instance dog will look for appreciation after each step i.e. even if it starts running after the ball and looks at the owner appreciating then the dog will think of getting the reward. Similarly, if the dog is sitting and not going after the ball then the owner’s scolding will help the dog to understand and will make the dog change the action.

The Disadvantage of Reinforcement Learning

During any reinforcement learning problem the agent tries to build an optimal policy but at the same time it faces the dilemma of exploring new states while maximizing the rewards at the same time. This phenomenon faced by the agent is known as Exploration vs. Exploitation Trade-off.

To be precise, Exploration is finding more information about the environment and discovering new actions which can be taken to get more rewards. Whereas, Exploitation Trade-off is exploiting known information to maximize the rewards.

In our dog example, let’s consider the owner does not scold the dog for not bringing the ball and the dog is very lazy. So whenever the owner will throw the ball, the dog will not leave its place since the dog is getting the reward in the form of rest and it keeps on resting which is analogous to Exploitation. But if the dog tries to bring the ball back and discover that it receives food as a reward. This is what Exploration will be termed as since dog explored some new actions to get new rewards.

This drawback arises because the agent in most cases memorizes one path and will never try to explore any other paths. So we want that the agent not only continues to exploit new paths but also keep on searching for new paths, this is decided by a hyper-parameter which suggest how much exploration and how much exploitation is needed.

Applications of Reinforcement Learning Around Us

We have already discussed that Reinforcement learning is the best possible option where information about that particular task/environment is limited. So now let’s look at such applications: Playing the Games like Go/Chess
AlphaGo, as mentioned earlier, is a computer program that registered a victory against one of the best players in the world.  AlphaGo has used RL for deciding which move should be taken based on current inputs and actions.

Robot Control
Robots have learned to walk, run, dance, fly, play various sports and perform mundane tasks using RL.

Online Advertising
Using reinforcement learning, the definition of broadcasting advertisements is totally changed. Now the user views the ads at right time as per their history and interests. There are applications of RL which include cross-channel marketing optimization and real-time bidding systems.

Dialogue Generation
A conversational agent speaks a sentence based on future looking, i.e. long-term reward. So making involving both the speakers more involved in the conversation.

Education and Training
There are numerous online educational platforms which are looking to incorporate RL in their tutoring systems and personalized learning. With the use of RL, the students will have the advantage of having the study material as per their learning capability.

Health and Medicine
The RL shows results by looking at similar problems and how they were dealt with. Through this, RL suggests the optimal treatment policies for the patients.

RL has been used to perform various financial tasks like the stock prediction on the basis of past and present performance of stocks. There have been many companies trying to bring in Reinforcement learning application in their company’s functionality by bringing system for Trade execution.

Real world implementation of Reinforcement Learning
To get a deeper insight into how reinforcement learning is implemented, have a look at the following links for the same:-

  1. Reinforcement learning for Stock Prediction
  2. Reinforcement learning for meal-planning
  3. Reinforcement learning for Sports Betting

You might also like More from author