# Application of Reinforcement Learning

In this article, we will learn about Reinforcement Learning. We are not born into the world not knowing much about the living surroundings. In the course of our lifetime, we slowly gain an understanding of the world through interaction. By interacting with the world, we learn about the causes and effects, and how the world responds to our actions.

Once we have an understanding of how the world works, we can use our knowledge to accomplish civic goals. This learning from the interaction is known as reinforcement learning.

*Reinforcement learning is an area of machine learning concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward.* (Source: Wikipedia)

In the field of reinforcement learning, we refer to the learner or decision maker as the agent. The set of conditions, it is provided with is referred to as the environment. The response to the learner is termed as rewards.

The agent performs certain actions in the environment which can have a positive or negative effect. It is decided by the interpreter. The interpreter, based on the efficiency of the action, provides positive or negative rewards to the agent and takes it to the next stage. The goal of the agent is to maximize the total positive reward. It remembers its actions from past and acts accordingly so to maximize the total rewards.

Let’s think of the agent as a small puppy born into the world, without any understanding of how anything works. Now say that the owner communicates to the puppy, how he would like it to behave. The puppy looks at the command and based on that observation, it is expected to choose how to respond. Of course, it has an invested interest in responding appropriately to get a treat (which is the reward).

But it doesn’t know what any of the actions do yet or what effect will it have on the world. So it has to try them out and see what happens. At this point, it has no reason to favor any action to the owner’s command so it chooses an action at random. After taking the action, it waits for a response. In response to its action, it receives a feedback from his owner.

If it does what it was commanded to do, it receives a reward in form of a treat. But if it doesn’t, he receives a negative reward in form of scolding. In general, its aim is to get maximum treats from the owner. It may take the puppy some time to get an idea of what is happening it should be able to figure it out eventually. The same situation happens with a reinforcement learning agent.

It interacts with the environment and eventually figures out how to gain maximum rewards. The agent explores all potential hypotheses to choose actions for maximizing the rewards, rather than exploiting limited knowledge about what is already known and should work well.

**Application Of Reinforcement Learning **

The applications of reinforcement learning are numerous and diverse, ranging from self-driving cars to board games. One of the major breakthroughs in machine learning in the 90s was TD- Gammon, an algorithm that used RL to play backgammon. Recently an RL trained agent was able to beat professionals in Alpha Go, another very complicated game.

Jumping to a completely different domain, RL is also used in robotics. For instance, it is used to teach robots to walk. RL is successfully used in self-driving cars, ships, and airplanes. It is even used in finance, biology, telecommunication, and various other businesses.

**MDPs and One-Step Dynamics
**Markov Decision Processes are used to rigorously define an RL problem.

- The
**state space**S is the set of all (*non-terminal*) states. - In episodic tasks, we use S+ to refer to the set of all states, including terminal states.
- The
**action space**A is the set of possible actions. (Alternatively, A(*s*) refers to the set of possible actions available in state s in*s*∈) - The
**return at time step***t*is*Gt*=*Rt*+1 +*Rt*+2 +*Rt*+3 +… - The agent selects actions with the goal of maximizing expected (discounted) return.
- The
**one-step dynamics**of the environment determine how the environment decides the state and reward at every time step.

A **(finite) Markov Decision Process (MDP)** is defined by:

- a (finite) set of states S (or S+, in the case of an episodic task)
- a (finite) set of actions A
- a set of rewards R
- the one-step dynamics of the environment
- the discount rate
*γ*∈ [0,1] - The discounted return at time step
*t*is*Gt*=*Rt*+1 +*γRt*+2 +*γ*2*Rt*+3 +…. - The discount rate
*γ*is something that you set, to refine the goal that you have the agent. It must satisfy 0≤*γ*≤1. If*γ*=0, the agent only cares about the most immediate reward. If*γ*=1, the return is not discounted. For larger values of*γ*, the agent cares more about the distant future. Smaller values of*γ*result in more extreme discounting, where – in the most extreme case – agent only cares about the most immediate reward.

Let’s understand MDPs with help of an example. Consider a recycling robot that runs on battery. It picks up cans scattered all around the room when it has sufficient battery. Whenever the battery is low, the robot is supposed to go to its docking station to recharge itself. There are various states and actions possible for the robot.

If it has a high battery, it can clean the room or wait in a situation of already cleaned room. If it has a low battery, it has to recharge itself. If it keeps functioning in spite of a low battery, it can halt at any point of time and human intervention would be required to take it to its docking station. Based on these, we can define an MDP for this robot in the following fashion:

**States: **{HIGH, LOW}**
Action:**{SEARCH, RECHARGE, WAIT}

**{Positive for searching in a high state, Positive for recharging in a low state, Neutral for waiting, Negative for searching in low state}**

Reward:

Reward:

**One step dynamics:**(learned from the evaluation of rewards over time)

**Discount rate:**[0, 1]

**An Approach of Handling RL Problems**

The agent looks for the best policies to maximize the reward.

A **deterministic policy** is a mapping *π*:S→A. For each state *s *∈ S, it yields the action* a *∈ A that the agent will choose while in state s.

A **stochastic policy** is a mapping π:S×A→[0,1]. For each state* s *∈ S and action *a *∈ A, it yields the probability *π*(*a*∣*s*) that the agent chooses action *a* while in state *s*.

As in the example of recharging robot, a deterministic policy would tell that when a robot is in a low state, it will recharge and when in a high state, it will search or wait. But a stochastic policy will tell the probability that the robot will recharge when in low or high state and the probability that it will search in a low/high state.

**State-Value Functions**

The state-value function for a policy *π* is denoted *vπ* . For each state* s *∈ S, it yields the expected return if the agent starts in state *s* and then uses the policy to choose its actions for all time steps.

That is,* vπ* (*s*)≐E*π* [*Gt* ∣*St* =*s*]. We refer to* vπ* (*s*) as the value of state *s* under policy** ***π*. E*π* [⋅] is defined as the expected value of a random variable, given that the agent follows policy *π*.

**Optimality**

A policy *π*′ is defined to be better than or equal to a policy *π* if and only if* vπ*′ (*s*)≥*vπ* (*s*) for all* s *∈ S. An **optimal policy ***π*∗ satisfies *π*∗ ≥*π* for all policies *π*. An optimal policy is guaranteed to exist but may not be unique. All optimal policies have the same state-value function *v*∗, called the optimal state-value function.

The action-value function for a policy *π* is denoted* qπ*. For each state* s *∈ S and action *a *∈ A, it yields the expected return if the agent starts in state *s*, takes action *a*, and then follows the policy for all future time steps. We refer to* qπ* (*s*, *a*) as the value of taking action *a* in state *s* under a policy *π*. All optimal policies have the same action-value function, called the optimal action-value function.

For example, suppose a person has to reach to point B from a point A. There can be several paths between these two points. But there is only one path that takes the shortest time to reach. When the user takes minimum time to travel between the points, we can say that he has used an optimal policy by choosing the best path possible.

Majorly two algorithms are used in solving RL problems namely, Monte Carlo Methods and Temporal Difference Methods.

**Monte Carlo Methods**

**MC Prediction: State Values**

Algorithms that solve the **prediction problem** determine the value function *vπ* (or* qπ* ) corresponding to a policy *π*. Methods that evaluate a policy *π* from interaction with the environment fall under one of two categories:

**On-policy**methods have the agent interact with the environment by following the same policy*π*that it seeks to evaluate (or improve).**Off-policy**methods have the agent interact with the environment by following a policy*b*(where*b*≠*π*) that is different from the policy that it seeks to evaluate (or improve).

Each occurrence of state *s *∈ S in an episode is called a **visit to** *s*.

There are two types of Monte Carlo (MC) prediction methods (for estimating * vπ* ):

**First-visit MC** estimates *vπ* (*s*) as the average of the returns following *only first* visits to *s*(that is, it ignores returns that are associated with later visits).

**Every-visit MC** estimates* vπ* (*s*) as the average of the returns following *all* visits to *s*.

**MC Prediction: Action Values**

There are two types of MC prediction methods for estimating* qπ* :

**First-visit MC**estimates*qπ*(*s*,*a*) as the average of the returns following*only first*visits to s (that is, it ignores returns that are associated to later visits).**Every-visit MC**estimates*qπ*(*s*,*a*) as the average of the returns following*all*visits to s.

**Generalized Policy Iteration**

Algorithms designed to solve the **control problem** determine the optimal policy *π*∗ from interaction with the environment. **Generalized policy iteration (GPI)** refers to the general method of using alternating rounds of policy evaluation and improvement in the search for an optimal policy.

**MC Control: Policy Improvement**

A policy is **greedy** with respect to an action-value function estimate *Q* if, for every state *s *∈ S, it is guaranteed to select an action* a *∈ A(*s*) such that *a*=argmax(*a*∈A(*s*) *Q*(*s*,*a*)). It is common to refer to the selected action as the **greedy action**. A policy is *ϵ***-greedy** with respect to an action-value function estimate *Q* if for every state *s *∈ S,

with probability 1-*ϵ*, the agent selects the greedy action, and

with probability *ϵ*, the agent selects an action (uniformly) at random.

In order for MC control to converge to the optimal policy, the **Greedy in the Limit with Infinite Exploration (GLIE)** conditions must be met:

- every state-action pair is visited infinitely many times, and
- The policy converges to a policy that is greedy with respect to the action-value function estimate
*Q*.

**Temporal Difference Methods**

**TD Prediction: TD(0)**

Whereas Monte Carlo (MC) prediction methods must wait until the end of an episode to update the value function estimate, temporal-difference (TD) methods update the value function after every time step.

For any fixed policy, **one-step TD** (or **TD(0)**) is guaranteed to converge to the true state-value function, as long as the step-size parameter *α* is sufficiently small.

In practice, TD prediction converges faster than MC prediction.

**TD Control: Sarsa(0)**

**Sarsa(0)** (or **Sarsa**) is an on-policy TD control method. It is guaranteed to converge to the optimal action-value function *q*∗, as long as the step-size parameter *α* is sufficiently small and *ϵ* is chosen to satisfy the **Greedy in the Limit with Infinite Exploration (GLIE)** conditions.

**Analyzing Performance**

- On-policy TD control methods have better online performance than off-policy TD control methods (like Q-learning).
- Expected Sarsa generally achieves better performance than Sarsa.

Reinforcement problems work with discrete spaces. But in practical life, we have continuous spaces to deal with. It is the point where Markov Decision Processes fail. To overcome this shortcoming, Deep Reinforcement Learning is gaining momentum.

**Deep Reinforcement Learning**

Deep reinforcement learning (DRL) is an exciting area of AI research, with potential applicability to a variety of problem areas. Some see DRL as a path to artificial general intelligence, or AGI, because of how it mirrors human learning by exploring and receiving feedback from environments. (Source: VentureBeat)

*It is basically RL applied with neural networks. *

Let’s take a look at the below image. A computer is being taught to play the famous Mario game. Consider a particular instance of time, as in the image. What is the required action at this point in time, to jump or to run? The model will decide the action being fed into a convolutional neural network (CNN). This CNN will provide information to the model about similar instances in the past and corresponding rewards. The model will then choose the action that will yield it maximum rewards

DRL solves the problem of discrete spaces by making the reinforcement algorithms work in continuous spaces. It is relatively a new concept and research is going on in this area.