The post Beginner’s Guide to Reinforcement Learning appeared first on StepUp Analytics.

]]>- What is Reinforcement learning in simple words?
- The components of Reinforcement Learning problem.
- Distinguishing between Reinforcement learning, Supervised and Unsupervised learning.
- Algorithms used for implementing RL.
- Practical implementation of Reinforcement learning.
- Ways used for learning.
- The disadvantage of Reinforcement Learning.
- Applications of Reinforcement Learning around us.
- Real world implementation of Reinforcement Learning.

Reinforcement Learning is learning the best actions on the basis of rewards and punishment. But when we wear our technical goggles, then Reinforcement Learning is defined using three basic concepts i.e. states, actions, and rewards.

Here the “**state” **defines a situation in which an agent is present who performs some “**actions**” and based upon these actions the agent receives either rewards or punishment.

When we consider the example of the dog, there we have the owner of the dog and the “**dog**” (**Agent**) itself. Now when the owner of the dog is present in the garden with the dog, he/she throws away a ball**. **This throwing away of the ball is the “**state**” for the **agent** and now the **dog **will run after the ball which will be the “**action”. **

The result will be an appreciation or food for the dog from the owner which will be “**reward” **as a result of the action and if the dog does not go after the ball another alternate action then it may get some “**punishment”**. Therefore, this is what Reinforcement Learning is all about. Next, we’ll understand the terminology which Reinforcement learning comprises of.

Now for each and every Reinforcement Learning problem, there are some predefined components which help in better representation and understanding of the problem. The following are the components:-

**Agent**: Agent takes actions; as mentioned earlier in our example, the dog is the **agent**

**Action (A)**: The agent has set of actions **A **from which it selects which action to perform. Just like the dog who decided whether to go after the ball, just look at the ball or jump at the position.

**Discount Factor: **The **discount factor** is multiplied with the future rewards as discovered by the agent to reduce the effect of the agent’s choice of action. To simplify this, through **discount factor** we are making the future rewards less valuable than immediate rewards. This makes the agent look at short-term goals itself. So lesser the value of discount factor the more insignificant future rewards will become and vice versa.

**Environment: **It is the surroundings of the agent in which it moves. In the dog example, **the environment **consists of the owner and the garden in which the dog is present. It is the **environment **which gives the agent its rewards as an output based upon the agent’s current state and action as inputs.

**State: **A state is an immediate situation in which the agents finds itself in relation to other important things in the surroundings like tools, obstacles, enemies and prizes/rewards. Here the dog is required to

**Reward(R): **The **reward** is the output which is received by the agent in response to the actions of the agent. For example, the dog receives **dog food** as a **reward** if the dog (**agent**) brings back the ball otherwise it receives scolding as a **punishment** if it does not wish to do so.

**Policy: **Here policy is the strategy which agent uses to determine the actions which should be taken on the basis of the current state. Basically the agent’s maps states to actions i.e. it decides the actions which are providing the maximum rewards with regards to states. Talking about the dog example, when the dog comes to know that dog food will be given as a reward if it brings back the ball, keeping this in mind the dog will create its own policy to reap maximum rewards.

**Markov Decision Processes (MDP’s) **are mathematical frameworks to describe an environment in reinforcement learning and almost all RL problems can formalize using MDP’s.

Basically, **MDP’s **consist of a set of finite environment states S, a set of possible actions A(s) in each state, a real-valued reward function R(s) and a transition model as well.

All those who possess some basic knowledge of Artificial Intelligence would be well aware of the terms Supervised and Unsupervised learning. Similarly, Reinforcement learning has been the buzzword in the field of AI and its implementations have gained huge popularity. **For example**, The very famous **AlphaGo** was developed using Reinforcement learning by **Google Deepmind**, which went on to defeat the World Champion “**Lee Sedol”** of the Game **Go**.

Now you must be wondering why supervised learning or unsupervised learning was not used. So let’s look at the areas where Reinforcement learning is better as compared to the other two methods.

Supervised Learning gets its name from the usage of an external supervisor who is aware of the environment and shares the same knowledge with the agent for accomplishing the task. In general supervised learning is like learning from tasks which have been already completed and as an agent you have to obtain the experience from this. But in some cases, there are no tasks from which any experiences can be gained and thus we cannot have any supervisor.

Since the game of Go has to move counts in billions we cannot create a knowledge repository and thus, the only option left is playing more and more games to gain experience and extract knowledge from it.

So both supervised learning and reinforcement learning we are mapping between input and output but in reinforcement learning the reward function acts as the feedback or experience which is in contrast to supervised learning.

In unsupervised learning, there is no concept of mapping between input and output, unlike reinforcement learning. In unsupervised learning, our main aim is to find the hidden patterns. For example, most of the recommendation systems like movie recommendation, news articles use unsupervised learning for the same. So in this, we are building a knowledge graph on the basis of constant feedback which the customer provides by liking particular movies/articles and then similar things are recommended.

Supervised and Unsupervised Machine Learning Algorithms **Read**

Reinforcement learning along with its fundamental concepts needs to be implemented practically and for that, we use the following algorithms. Let’s have a look at those algorithms:

Q learning is the most used reinforcement learning algorithm. By the usage of this algorithm, the agent learns the quality (**Q value**) of each action (i.e. **policy**) based on how much reward the environment returns with.

Q Learning uses the table to store the value of each environment’s state along with the Q value.

SARSA resembles Q-learning to a lot extent. The only difference between the two is that SARSA learns the Q-value based on the action performed by the current policy as compared to Q-learning’s way of using greedy policy.

Now we will have a look at one of the basic implementation of Reinforcement Learning using the **OpenAI Gym library**. The Gym compares the different algorithms of Reinforcement Learning.

The gym provides us with a variety of test problem i.e. **environments**, all which can be used to know more about our reinforcement learning algorithms.

Before starting to work with Gym, we need to install gym using pip:

pip install gym

Directly from iPython notebook.

!pip install gym

After this, we are ready to start.

First, we are importing the gym library which we had installed earlier. Then we are using one of the inbuilt environments of Gym i.e. **CartPole**. This environment will display a pole trying to balance on a cart which is moving left and right.

**Step 2:**

Here we have created a function which will take care of the action which should be taken by the agent on the basis of state and environment. Using this function, our main aim is to maintain the pole present on the cart should try to balance and not fall down. So if the pole bends more than a given angle we are returning 0 as a result.

In this total list, we are storing rewards which will be collected

Here in this loop, we are calculating the rewards obtained in each episode by going over the loop. Each loop starts with an observation value which has been reset using reset () function.

Here we are running an instance of CartPole environment for 1000 timesteps, which will fetch the environment each time. There will be a small popup window displaying cart-pole using the render () function. Along with this, we are deciding the current action based upon the observation obtained by the agent’s previous actions by calling the basic_policy () function.

Now next we have used step () function which will return four values which are **observation: object type (**this will tell about the observation of the environment**)**, **reward: float type (**amount of reward received by previous action**)**, **done: Boolean type (**this tells about whether the episode has terminated or not**)**, and **info: dictionary type (**the information provided by this dictionary is used for debugging and also for learning about the environments**)**. So in this loop, at each timestep, the **agent chooses an action** and the **environment** returns an **observation** and a **reward**.

Lastly in the loop once the done variable returns the value as “true”, then we come out of the loop and append the episode_rewards value to the totals[].

Finally, we are printing the totals [] list which has the maximum rewards values and along with this, we are printing the maximum reward obtained. Most importantly we are using close () function to close the pop-up window otherwise, the program may crash.

The code of this implementation can be found here

To implement Reinforcement learning we need to have some predefined method of learning i.e. how the agent will be understanding which action should be taken to maximize the rewards.

For the above-mentioned reason, we have two methods used for learning which are as follows:-

In this method, the agent completes the episode (i.e. reaches a “terminal state”) and then looks at the **total rewards to see how well it has performed**. Here in the Monte Carlo method, the **rewards collection is done at the end of the episode** and then on the basis of the result, **the maximum expected future reward is calculated**.

**For example**: If we understand through the **dog example, **here the agent i.e. dog will be using **Monte Carlo approach** and completing the action of bringing the thrown ball back and then analyze the rewards which it received. On the basis of rewards, the dog will decide which actions should be performed in near future to maximize the reward.

When we look at the TD Learning method, here the rewards obtained are **analyzed after each step** and then on the basis of this only **maximum expected future reward** is calculated. Therefore, after each step, the agent decides which action should be taken to get maximum rewards.

**For example**: Again using the dog example, in this instance dog will look for appreciation after each step i.e. even if it starts running after the ball and looks at the owner appreciating then the dog will think of getting the reward. Similarly, if the dog is sitting and not going after the ball then the owner’s scolding will help the dog to understand and will make the dog change the action.

During any reinforcement learning problem the agent tries to build an optimal policy but at the same time it faces the dilemma of exploring new states while maximizing the rewards at the same time. This phenomenon faced by the agent is known as **Exploration vs. Exploitation Trade-off**.

To be precise, **Exploration** is finding more information about the environment and discovering new actions which can be taken to get more rewards. Whereas, **Exploitation Trade-off** is exploiting known information to maximize the rewards.

**In our dog example**, let’s consider the owner does not scold the dog for not bringing the ball and the dog is very lazy. So whenever the owner will throw the ball, the dog will not leave its place since the dog is getting the reward in the form of rest and it keeps on resting which is analogous to **Exploitation. **But if the dog tries to bring the ball back and discover that it receives food as a reward. This is what **Exploration** will be termed as since dog explored some new actions to get new rewards.

This drawback arises because the agent in most cases **memorizes** one path and will never try to explore any other paths. So we want that the **agent not only continues to exploit new paths but also keep on searching for new paths**, this is decided by a **hyper-parameter** which suggest how much exploration and how much exploitation is needed.

We have already discussed that Reinforcement learning is the best possible option where information about that particular task/environment is limited. So now let’s look at such applications:** Playing the Games like Go/Chess
**AlphaGo, as mentioned earlier, is a computer program that registered a victory against one of the best players in the world. AlphaGo has used RL for deciding which move should be taken based on current inputs and actions.

**Robot Control
**Robots have learned to walk, run, dance, fly, play various sports and perform mundane tasks using RL.

**Online Advertising
**Using reinforcement learning, the definition of broadcasting advertisements is totally changed. Now the user views the ads at right time as per their history and interests. There are applications of RL which include

**Dialogue Generation
**A conversational agent speaks a sentence based on future looking, i.e. long-term reward. So making involving both the speakers more involved in the conversation.

**Education and Training
**There are numerous online educational platforms which are looking to incorporate RL in their tutoring systems and personalized learning. With the use of RL, the students will have the advantage of having the study material as per their learning capability.

**Health and Medicine
**The RL shows results by looking at similar problems and how they were dealt with. Through this, RL suggests the optimal treatment policies for the patients.

**Finance
**RL has been used to perform various financial tasks like the stock prediction on the basis of past and present performance of stocks. There have been many companies trying to bring in Reinforcement learning application in their company’s functionality by bringing system for Trade execution.

**Real world implementation of Reinforcement Learning
**To get a deeper insight into how reinforcement learning is implemented, have a look at the following links for the same:-

- Reinforcement learning for Stock Prediction
- Reinforcement learning for meal-planning
- Reinforcement learning for Sports Betting

The post Beginner’s Guide to Reinforcement Learning appeared first on StepUp Analytics.

]]>The post Ridge Regression and Its Application appeared first on StepUp Analytics.

]]>The OLS function works quite well when some assumptions like a linear relationship, no autocorrelation, homoscedasticity, more observations than variables, normal distribution of the residuals and No or little multicollinearity are fulfilled.

But in many real-life scenarios, these assumptions are violated. In those cases, we need to find alternative approaches to provide solutions. Penalized/Regularized regression techniques such as ridge, lasso and elastic net regression work very well in these cases. In this article, I have tried to explain the ridge regression technique which is a way of creating regression models when the number of predictor variables of a dataset is more than the number of observations or when the data suffers from multicollinearity (independent variables are highly correlated).

*Regularization** *methods provide a means to control our regression coefficients, which can help to reduce the variance and decrease the sampling error. Ridge regression belongs to a class of regression tools that use L2 regularization. L2 regularization works as a small addition to the OLS function that weights the residuals in a particular way to make the parameters more stable. **The **L2 penalty parameter, which equals the square of the magnitude of coefficients, is given by,

And the regression function is given by,

The amount of the penalty can be fine-tuned using a constant called lambda (λ). Selecting a good value for λ is critical. When λ = 0, the penalty term has no effect and ridge regression produces classical least square coefficients. If λ = ∞, the impact of the penalty grows and all coefficients are shrunk to zero. The ideal penalty is therefore somewhere in between 0 and ∞.

In this way, ridge regression puts constraints on the magnitude of the coefficients and help to reduce the magnitude and fluctuations of the coefficients and progressively shrinks them towards zero. This will definitely help to reduce the variance of the model. The outcome is typically a model that fits the training data less well than OLS but generalizes better because it is less sensitive to extreme variance in the data such as outliers.

Note that, in contrast to the ordinary least square regression, ridge regression is highly affected by the scale of the predictors. Therefore, it is better to standardize (i.e., scale) the predictors before applying the ridge regression so that all the predictors are on the same scale.

**Advantages and Disadvantages Of Ridge Regression**

- Least squares regression doesn’t differentiate “important” from “less-important” predictors in a model, so it includes all of them. This leads to overfitting a model and failure to find unique solutions.
**Ridge regression avoids these problems.**

- Ridge regression works in part because it doesn’t require unbiased estimators; while least squares produce unbiased estimates; its variances can be so large that they may be wholly inaccurate.

- Ridge regression adds just enough bias to make the estimates reasonably reliable approximations to true population values.

- One important advantage of the ridge regression is that it still performs well, compared to the ordinary least square method in a situation where you have a large multivariate data with the number of predictors (p) larger than the number of observations (n).

- The ridge estimator is especially good at improving the least-squares estimate when multicollinearity is present.

- Firstly ridge regression includes all the predictors in the final model, unlike the stepwise regression methods which will generally select models that involve a reduced set of variables.

- A ridge model does not perform feature selection. If a greater interpretation is necessary where we need to reduce the signal in our data to a smaller subset then a lasso model may be preferable.

- Ridge regression shrinks the coefficients towards zero, but it will not set any of them exactly to zero. The lasso regression is an alternative that overcomes this drawback.

Here I have given the link of a website below, where you can get the mathematical and geometric interpretation of Ridge regression **More Info**

Loading the MASS package to get the data set

**library (MASS)
**

Splitting the dataset in training and testing data

**train <- data [1:400,]
**

Loading libraries required for Ridge regression

**library(tidyverse)
**

**library (glmnet)
**

We need to know the **glmnet **package

- The glmnet package provides the function glmnet () for ridge regression. Rather than accepting a formula and data frame, it requires a vector input and matrix of predictors.

- We must specify alpha = 0for ridge regression.( for lasso alpha = 1 and for elastic net, 0 < = alpha < = 1)

- Ridge regression also involves tuning a hyperparameter lambda ( λ).[ Discussed earlier ]

- In case of classification or penalized logistic regression method we mention
**family = “binomial”**.

For more details about this package: **More Info**

There is another function **lm.ridge ()** in **MASS** package which can also be used. Please see the link below for more details about the function. **More Info** [Page Number: 79]

Preparing the training data set for training the regression model

**x.train <- model.matrix (medv~., train) [,-1]**

We save the response variable housing price in a vector** y.train
**

We need to find the best value for lambda for the given data set with the function **cv.glmnet()
**

Displaying the best lambda value

**cv$lambda.min**

We fit the final model on the training data by adding the best lambda value.

**model_ridge <- glmnet (x.train, y.train, alpha = 0, lambda = cv$lambda.min)**

Displaying the regression coefficients below

**coef (model_ridge)**

Preparing the test data set to be used as a data matrix and discarding the intercept for predicting the values of the response variable.

**x.test <- model.matrix (medv ~., test)[,-1]**

We save the predicted values of the response variable Housing price in a vector **prediction_ridge
**

Saving the RMSE, SSE and MAPE value of the predicted values of the test data set in **Accuracy_ridge
**

Now we fit the multiple linear regression model on the training data set

**names (train)
**

From the summary of the model we can find the p value of the individual predictor variables and decide which variables to be kept in the model

**summary (model_lm)
**

We need to check the multicollinearity with the help of the function **vif () **from** car **package.

**vif (model_lm)**

We also need to exclude the predictor variables with high vif values to avoid multicollinearity. Though we may allow multicollinearity up to a certain level.

**model_lm <- lm (medv ~ crim+zn+nox+rm+dis+rad+ptratio+lstat, data=train) **

Below I have mentioned the summary of the updated final model with all the significant variables and the vif values of the variables. The values of the R square and adjusted R square are pretty close, which also shows that the present predictor variables in the model are pretty significant.

**summary (model_lm)**

**vif (model_lm)**

We compute the prediction of the test data set with multiple linear regression which was trained using the training dataset

**prediction_lm <- predict (model_lm, test [,-14])**

We find out the RMSE, SSE, and MAPE of the regression model and save them in **Accuracy_lm
**

We save the RMSE, SSE and MAPE values of both linear and ridge regression models in **Accuracy.
**

From the **Accuracy** mentioned above, it is clear that even though the least square estimates are unbiased; the accuracy of the model is compromised. By adding a degree of bias to the regression estimates, ridge regression reduces the standard errors. But with other models like the lasso and elastic net regression, we have a possibility of getting a better accuracy value.

This is because complicated models tend to overfit the training data. In my next article, I will introduce you to lasso and elastic net regression and explain the comparative advantage of using these models over multiple linear or ridge regression models.

To learn more on Statistics for Data Science **Read**

The post Ridge Regression and Its Application appeared first on StepUp Analytics.

]]>The post Supervised vs Unsupervised Machine Learning appeared first on StepUp Analytics.

]]>So what is required for creating such machine learning systems? Following are the things required in creating such machine learning systems:

**Data –**Input data is required for predicting the output.**Algorithms –**Machine Learning is dependent on certain statistical algorithms to determine data patterns.**Automation –**It is the ability to make systems operate automatically.**Iteration –**The complete process is an iterative i.e. repetition of the process.**Scalability –**The capacity of the machine can be increased or decreased in size and scale.**Modeling –**The models are created according to the demand by the process of modeling.

Machine Learning methods are classified into certain categories. These are:

**Supervised Learning –**In this method, input and output are provided to the computer along with feedback during the training. The accuracy of predictions by the computer during training is also analyzed. The main goal of this training is to make computers learn how to map input to the output.

**Unsupervised Learning –**In this case, no such training is provided leaving computers to find the output on its own. Unsupervised learning is mostly applied to transactional data. It is used in more complex tasks. It uses another approach of iteration known as deep learning to arrive at some conclusions.

**Reinforcement Learning –**This type of learning uses three components namely – agent, environment, action. An agent is the one that perceives its surroundings, an environment is the one with which an agent interacts and acts in that environment. The main goal in reinforcement learning is to find the best possible policy.

Machine learning makes use of processes similar to that of data mining. Machine learning algorithms are described in terms of target function(f) that maps input variable (x) to an output variable (y). This can be represented as:

**y=f(x)**

There is also an error e which is the independent of the input variable x. Thus the more generalized form of the equation is:

**y=f(x) + e**

In machine, the mapping from x to y is done for predictions. This method is known as predictive modeling to make the most accurate predictions. There are various assumptions for this function.

Everything is dependent on machine learning. Find out what are the benefits of machine learning.

**Decision making is faster –**Machine learning provides the best possible outcomes by prioritizing the routine decision-making processes.**Adaptability –**Machine Learning provides the ability to adapt to new changing environment rapidly. The environment changes rapidly due to the fact that data is being constantly updated.**Innovation –**Machine learning uses advanced algorithms that improve the overall decision-making capacity. This helps in developing innovative business services and models.**Insight –**Machine learning helps in understanding unique data patterns and based on which specific actions can be taken.**Business growth –**With machine learning overall business process and workflow will be faster and hence this would contribute to the overall business growth and acceleration.**The outcome will be good –**With machine learning the quality of the outcome will be improved with lesser chances of error.

The post Supervised vs Unsupervised Machine Learning appeared first on StepUp Analytics.

]]>The post Application of Reinforcement Learning appeared first on StepUp Analytics.

]]>Once we have an understanding of how the world works, we can use our knowledge to accomplish civic goals. This learning from the interaction is known as reinforcement learning.

*Reinforcement learning is an area of machine learning concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward.* (Source: Wikipedia)

In the field of reinforcement learning, we refer to the learner or decision maker as the agent. The set of conditions, it is provided with is referred to as the environment. The response to the learner is termed as rewards.

The agent performs certain actions in the environment which can have a positive or negative effect. It is decided by the interpreter. The interpreter, based on the efficiency of the action, provides positive or negative rewards to the agent and takes it to the next stage. The goal of the agent is to maximize the total positive reward. It remembers its actions from past and acts accordingly so to maximize the total rewards.

Let’s think of the agent as a small puppy born into the world, without any understanding of how anything works. Now say that the owner communicates to the puppy, how he would like it to behave. The puppy looks at the command and based on that observation, it is expected to choose how to respond. Of course, it has an invested interest in responding appropriately to get a treat (which is the reward).

But it doesn’t know what any of the actions do yet or what effect will it have on the world. So it has to try them out and see what happens. At this point, it has no reason to favor any action to the owner’s command so it chooses an action at random. After taking the action, it waits for a response. In response to its action, it receives a feedback from his owner.

If it does what it was commanded to do, it receives a reward in form of a treat. But if it doesn’t, he receives a negative reward in form of scolding. In general, its aim is to get maximum treats from the owner. It may take the puppy some time to get an idea of what is happening it should be able to figure it out eventually. The same situation happens with a reinforcement learning agent.

It interacts with the environment and eventually figures out how to gain maximum rewards. The agent explores all potential hypotheses to choose actions for maximizing the rewards, rather than exploiting limited knowledge about what is already known and should work well.

The applications of reinforcement learning are numerous and diverse, ranging from self-driving cars to board games. One of the major breakthroughs in machine learning in the 90s was TD- Gammon, an algorithm that used RL to play backgammon. Recently an RL trained agent was able to beat professionals in Alpha Go, another very complicated game.

Jumping to a completely different domain, RL is also used in robotics. For instance, it is used to teach robots to walk. RL is successfully used in self-driving cars, ships, and airplanes. It is even used in finance, biology, telecommunication, and various other businesses.

**MDPs and One-Step Dynamics
**Markov Decision Processes are used to rigorously define an RL problem.

- The
**state space**S is the set of all (*non-terminal*) states. - In episodic tasks, we use S+ to refer to the set of all states, including terminal states.
- The
**action space**A is the set of possible actions. (Alternatively, A(*s*) refers to the set of possible actions available in state s in*s*∈) - The
**return at time step***t*is*Gt*=*Rt*+1 +*Rt*+2 +*Rt*+3 +… - The agent selects actions with the goal of maximizing expected (discounted) return.
- The
**one-step dynamics**of the environment determine how the environment decides the state and reward at every time step.

A **(finite) Markov Decision Process (MDP)** is defined by:

- a (finite) set of states S (or S+, in the case of an episodic task)
- a (finite) set of actions A
- a set of rewards R
- the one-step dynamics of the environment
- the discount rate
*γ*∈ [0,1] - The discounted return at time step
*t*is*Gt*=*Rt*+1 +*γRt*+2 +*γ*2*Rt*+3 +…. - The discount rate
*γ*is something that you set, to refine the goal that you have the agent. It must satisfy 0≤*γ*≤1. If*γ*=0, the agent only cares about the most immediate reward. If*γ*=1, the return is not discounted. For larger values of*γ*, the agent cares more about the distant future. Smaller values of*γ*result in more extreme discounting, where – in the most extreme case – agent only cares about the most immediate reward.

Let’s understand MDPs with help of an example. Consider a recycling robot that runs on battery. It picks up cans scattered all around the room when it has sufficient battery. Whenever the battery is low, the robot is supposed to go to its docking station to recharge itself. There are various states and actions possible for the robot.

If it has a high battery, it can clean the room or wait in a situation of already cleaned room. If it has a low battery, it has to recharge itself. If it keeps functioning in spite of a low battery, it can halt at any point of time and human intervention would be required to take it to its docking station. Based on these, we can define an MDP for this robot in the following fashion:

**States: **{HIGH, LOW}**
Action:**{SEARCH, RECHARGE, WAIT}

Reward:

The agent looks for the best policies to maximize the reward.

A **deterministic policy** is a mapping *π*:S→A. For each state *s *∈ S, it yields the action* a *∈ A that the agent will choose while in state s.

A **stochastic policy** is a mapping π:S×A→[0,1]. For each state* s *∈ S and action *a *∈ A, it yields the probability *π*(*a*∣*s*) that the agent chooses action *a* while in state *s*.

As in the example of recharging robot, a deterministic policy would tell that when a robot is in a low state, it will recharge and when in a high state, it will search or wait. But a stochastic policy will tell the probability that the robot will recharge when in low or high state and the probability that it will search in a low/high state.

The state-value function for a policy *π* is denoted *vπ* . For each state* s *∈ S, it yields the expected return if the agent starts in state *s* and then uses the policy to choose its actions for all time steps.

That is,* vπ* (*s*)≐E*π* [*Gt* ∣*St* =*s*]. We refer to* vπ* (*s*) as the value of state *s* under policy** ***π*. E*π* [⋅] is defined as the expected value of a random variable, given that the agent follows policy *π*.

A policy *π*′ is defined to be better than or equal to a policy *π* if and only if* vπ*′ (*s*)≥*vπ* (*s*) for all* s *∈ S. An **optimal policy ***π*∗ satisfies *π*∗ ≥*π* for all policies *π*. An optimal policy is guaranteed to exist but may not be unique. All optimal policies have the same state-value function *v*∗, called the optimal state-value function.

The action-value function for a policy *π* is denoted* qπ*. For each state* s *∈ S and action *a *∈ A, it yields the expected return if the agent starts in state *s*, takes action *a*, and then follows the policy for all future time steps. We refer to* qπ* (*s*, *a*) as the value of taking action *a* in state *s* under a policy *π*. All optimal policies have the same action-value function, called the optimal action-value function.

For example, suppose a person has to reach to point B from a point A. There can be several paths between these two points. But there is only one path that takes the shortest time to reach. When the user takes minimum time to travel between the points, we can say that he has used an optimal policy by choosing the best path possible.

Majorly two algorithms are used in solving RL problems namely, Monte Carlo Methods and Temporal Difference Methods.

Algorithms that solve the **prediction problem** determine the value function *vπ* (or* qπ* ) corresponding to a policy *π*. Methods that evaluate a policy *π* from interaction with the environment fall under one of two categories:

**On-policy**methods have the agent interact with the environment by following the same policy*π*that it seeks to evaluate (or improve).**Off-policy**methods have the agent interact with the environment by following a policy*b*(where*b*≠*π*) that is different from the policy that it seeks to evaluate (or improve).

Each occurrence of state *s *∈ S in an episode is called a **visit to** *s*.

There are two types of Monte Carlo (MC) prediction methods (for estimating * vπ* ):

**First-visit MC** estimates *vπ* (*s*) as the average of the returns following *only first* visits to *s*(that is, it ignores returns that are associated with later visits).

**Every-visit MC** estimates* vπ* (*s*) as the average of the returns following *all* visits to *s*.

There are two types of MC prediction methods for estimating* qπ* :

**First-visit MC**estimates*qπ*(*s*,*a*) as the average of the returns following*only first*visits to s (that is, it ignores returns that are associated to later visits).**Every-visit MC**estimates*qπ*(*s*,*a*) as the average of the returns following*all*visits to s.

Algorithms designed to solve the **control problem** determine the optimal policy *π*∗ from interaction with the environment. **Generalized policy iteration (GPI)** refers to the general method of using alternating rounds of policy evaluation and improvement in the search for an optimal policy.

A policy is **greedy** with respect to an action-value function estimate *Q* if, for every state *s *∈ S, it is guaranteed to select an action* a *∈ A(*s*) such that *a*=argmax(*a*∈A(*s*) *Q*(*s*,*a*)). It is common to refer to the selected action as the **greedy action**. A policy is *ϵ***-greedy** with respect to an action-value function estimate *Q* if for every state *s *∈ S,

with probability 1-*ϵ*, the agent selects the greedy action, and

with probability *ϵ*, the agent selects an action (uniformly) at random.

In order for MC control to converge to the optimal policy, the **Greedy in the Limit with Infinite Exploration (GLIE)** conditions must be met:

- every state-action pair is visited infinitely many times, and
- The policy converges to a policy that is greedy with respect to the action-value function estimate
*Q*.

Whereas Monte Carlo (MC) prediction methods must wait until the end of an episode to update the value function estimate, temporal-difference (TD) methods update the value function after every time step.

For any fixed policy, **one-step TD** (or **TD(0)**) is guaranteed to converge to the true state-value function, as long as the step-size parameter *α* is sufficiently small.

In practice, TD prediction converges faster than MC prediction.

**Sarsa(0)** (or **Sarsa**) is an on-policy TD control method. It is guaranteed to converge to the optimal action-value function *q*∗, as long as the step-size parameter *α* is sufficiently small and *ϵ* is chosen to satisfy the **Greedy in the Limit with Infinite Exploration (GLIE)** conditions.

- On-policy TD control methods have better online performance than off-policy TD control methods (like Q-learning).
- Expected Sarsa generally achieves better performance than Sarsa.

Reinforcement problems work with discrete spaces. But in practical life, we have continuous spaces to deal with. It is the point where Markov Decision Processes fail. To overcome this shortcoming, Deep Reinforcement Learning is gaining momentum.

Deep reinforcement learning (DRL) is an exciting area of AI research, with potential applicability to a variety of problem areas. Some see DRL as a path to artificial general intelligence, or AGI, because of how it mirrors human learning by exploring and receiving feedback from environments. (Source: VentureBeat)

*It is basically RL applied with neural networks. *

Let’s take a look at the below image. A computer is being taught to play the famous Mario game. Consider a particular instance of time, as in the image. What is the required action at this point in time, to jump or to run? The model will decide the action being fed into a convolutional neural network (CNN). This CNN will provide information to the model about similar instances in the past and corresponding rewards. The model will then choose the action that will yield it maximum rewards

DRL solves the problem of discrete spaces by making the reinforcement algorithms work in continuous spaces. It is relatively a new concept and research is going on in this area.

The post Application of Reinforcement Learning appeared first on StepUp Analytics.

]]>The post Classification Techniques On Life Expectancy Data appeared first on StepUp Analytics.

]]>In this article, we will learn about Classification Techniques. We as humans have been blessed with the concept of classification. We classify everything from our closet, where all the jeans go under one rack and all the shirts go in another meant only for shirts, to the apps on our phones and the files on our computers, where we have separate folders for each kind of files or apps.

Now a more “data scientific’ definition of classification is that it is a form of data analysis that extracts models describing important data classes or a task of predicting the value of the categorical variable (class or target). Basically, finding out which set of predefined categories will a new observation belong to. A very common example for this is with emails where we wish to classify certain emails as spam and others as not spam. The machine is able to achieve this task by learning from the training data whose class is already known.

Classification algorithms can only be used when we have discrete labels as outputs. A situation like the above example where emails are classified as spam or not, where there are only two possible outcomes, is called as binary classification.

Another type is a multi-labeled classification. In multi labeled classification multiple labels may be assigned to one instance. This is mostly used for audio and video classification, text classification, sentiment classification in sentiment analysis etc.

Anyway, that was the basics and the sort of prerequisite information required to move forward with this article.

In this article, we will classify the continents, which is the label and will be used as a class, in the Life Expectancy DataSet.

This is a very small dataset with 6 columns and 223 rows, one for each country. The columns are Rank, Country, Overall Life, Male Life, Female Life, and Continent.

To perform this classification we will use 5 different classification techniques and algorithms and calculate the precision and accuracy for each of the algorithms and compare them. The 5 classification algorithms are:

**KNN**— K Nearest Neighbour Algorithm uses similarity measures like distance functions (distance measures) to classify the new data points after going through training.**SVM**— Support Vector Machine is a supervised learning algorithm, it will create a model that will assign the new points to one or the other categories using the training set. The assignment could be linear or non-linear according to the problem.**OneR**— OneR is basically One Rule Algorithm, this algorithm generates one rule for each predictor in the data and then selects the rule with the smallest error as the answer. Even though this seems and is a very simple algorithm as it generates only one rule yet it is known to perform better than some of the more complex classification algorithms.**RIPPER**—RIPPER is a rule-based learner that builds a set of rules that identify the classes while minimizing the amount of error. The error is defined by the number of training examples misclassified by the rules. It is a direct way of performing rule-based classification.**C 4.5**— C4.5 is a statistical classifier as it generates a decision tree. It builds a decision tree from the training data just like ID3 does and at each node of the tree, C4.5 chooses the attribute of the data that most effectively splits its set of samples into subsets enriched in one class or another. This is in a way, an indirect method of rule-based classification.

Let us now begin with the analysis using R Programming and let us see which classifier performs the best. We will use the following libraries/packages throughout the code: e1071, class, caret, rJava, RWeka.

#loading libraries library("e1071") library(class) library(caret) library(rJava) library(RWeka)

The first step of data preprocessing will involve the following:

- Importing the data set in R by using the
*read.csv()*function. - Performing some visual descriptive analysis by looking at the data set and getting a summary of the data set by using the
*summary() and str()*function. - Converting the class label, Continent, to a categorical variable by factoring it.
- Some irrelevant columns will also be removed, the ones which won’t be used in the analysis. Like the first, Rank column.

#importing csv file in R dataset <- read.csv(file.choose()) #displaying head(first five) elements head(dataset) str(dataset) #dimentions dim(dataset) #Converting Continent to factor dataset[c("Continent")]<- lapply(dataset[c("Continent")], factor) #removing the first (irrelevant) coulmn dataset <- dataset[,-1] str(dataset) summary(dataset)

head(), str(), dim() functions

str(), summary() functions after the removal of the first column and the factor conversation

Although the Continent column was already of factor data type we still ran the command to make it factor nevertheless. With this view of the data we can get a clear idea of how the data looks, the *head()* function enables that. The summary functions show us some vital descriptive information.

Most importantly we can see how many countries lie in which continent, this will help us later while checking the accuracy. We can also observe the mean of the overall, male and female life expectancy, which is **72.49**, **70.04** and **75.02** respectively. Medians, quartiles, mix, max values can also be observed.

For the second part of the data pre-processing we will:

- Dividing the dataset into training and test set in 80:20 ratio by using the sampling method to generate the random permutation of training and test elements.
- Saving the train and test samples in a list in the Output variable.
- We can see the train and test samples by printing the Output variable.

#sampling 80% training data traindata <- sample(seq_len(nrow(dataset)), size = floor(0.80 * nrow(dataset))) data_train <- dataset[traindata, ] data_test <- dataset[-traindata,] t_train <- dataset$Continent[traindata] t_test <- dataset$Continent[-traindata] output<-list(data_train,data_test,t_train,t_test) #a view of the divided data(into train and test) print(output)

KNN classification will be performed with help of preprocessing and train methods, available in the caret package. The tuned length in train method will be chosen as 20 on basis of fit model results, it will help us to automatically select the best value.

In our case, K is chosen to be 5. Also, accuracy was used to select the optimal model using the largest value.

#KNN #setting seed set.seed(12345) knn_train_test<-output let_train<-knn_train_test[[1]] let_test<-knn_train_test[[2]] #Preprocessing and training trainX <- let_train[,names(let_train) != "Continent"] preProcValues <- preProcess(x = trainX,method = c("center", "scale")) print(preProcValues)

#Fit Model- Using Caret's train model to find best k ctrl <- trainControl(method="repeatedcv",repeats = 3) knnFit <- train(Continent~., data = let_train, method = "knn", trControl = ctrl,preProcess = c("center","scale"), tuneLength = 20) print(knnFit) plot(knnFit) #Make predictions knnPredict <- predict(knnFit,newdata = let_test ) knnPredict #Confusion Matrix confusionMatrix(knnPredict, let_test$Continent ) #Accuracy knnoutput<-mean(knnPredict== let_test$Continent) knnoutput

preprocessing using the caret package and finding the knn fit i.e. value of K

plot depicting the choice of the value of K by using accuracy

prediction, confusion matrix, and accuracy

- First and foremost we observe that the best value of K has been chosen to be 5 by looking at the highest accuracy (by repeated cross-validation).
- The plot also shows the highest accuracy value of 0.562 at K=5 and a very close competition is given by K=17 at the accuracy of 0.559.
- The
**accuracy by KNN is 44%.**

SVM classification function will be deployed with help of tune method and using e1071 package. The SVM fit classification will be tuned by choosing kernel as linear and cost as 1 from tune method.

#SVM #setting seed set.seed(12345) train_test<- output let_train<-train_test[[1]] let_test<-train_test[[2]] #Fit model svmfit <- svm(Continent ~., data = let_train, kernel = "linear", scale = FALSE) svmfit svm

#Tune to check best performance tuned <- tune(svm, Continent ~., data = let_train, kernel = "linear", ranges = list(cost=c(0.001,0.01,.1,1,10,100))) summary(tuned) #Make predictions p <- predict(svmfit, let_test, type="class") length(let_test$Continent) table(p, let_test$Continent) #Analyse results #Confusion matrix confusionMatrix(p, let_test$Continent ) #Accuracy #print(mean(p== let_test$Continent)) svmoutput<-mean(p== let_test$Continent) svmoutput

fitting the model using svmfit()

tuning to check for the best performance and predicting the classes

creating the confusion matrix and calculating the accuracy

- We have observed an
**accuracy of 55% with SVM**

#OneR #setting seed set.seed(12345) oner_train_test<- output let_train<-oner_train_test[[1]] let_test<-oner_train_test[[2]] #Fitting model model <- OneR(Continent~.,let_train) model #prediction pred <- predict(model, let_test) pred table(pred,let_test$Continent) summary(model) #confusion matrix confusionMatrix(pred, let_test$Continent) #Accuracy acc<-mean(pred==let_test$Continent) acc

model 1/2

model 2/2

prediction() function, table, and summary of the model

confusion matrix and accuracy

- We observe the modeling of 178 instances, mapped using the training data.
- While predicting using the test data there are only 38 correctly classified instances while 140 are wrongly classified instances because Africa has been taken as the one rule.
- This makes the
**accuracy only 20% with OneR Algorithm.**

Ripper classification function will be deployed with help of JRip method in RWeka package.

#RIPPER Algorithm #setting seed set.seed(12345) ripper_train_test<- output let_train<-ripper_train_test[[1]] let_test<-ripper_train_test[[2]] #fitting model using Weka control function of JRip model1 <- JRip(Continent~., data=let_train) model1 #prediction pred1 <- predict(model1, let_test) pred1 table(pred1, let_test$Continent) summary(model1) #confusion matrix confusionMatrix(pred1, let_test$Continent) #Accuracy acc<-mean(pred1==let_test$Continent) acc

modeling, prediction, tabulation of the prediction and summary

confusion matrix and accuracy

- While predicting using the test data 95 are correctly classified instances while 83 are wrongly classified.
- The confusion matrix clearly shows which continent was classified as what.
**48% accuracy**can be observed by using the RIPPER Algorithm.

C4.5 classification function has been deployed with help of J48 method in RWeka package.

#C.45 Algorithm #setting seed set.seed(12345) c45_train_test<- output let_train<-c45_train_test[[1]] let_test<-c45_train_test[[2]] # fit model-Using Weka Control function of J48 fit <- J48(Continent~., data=let_train) # summarize the fit summary(fit) # make predictions c45predictions <- predict(fit, let_test) # summarize accuracy tb<-table(c45predictions, let_test$Continent) #Confusion Matrix confusionMatrix(c45predictions, let_test$Continent ) #Accuracy #print(mean(c45predictions== let_test$Continent)) c45output<-mean(c45predictions== let_test$Continent) c45output

summary of the fit, confusion matrix, and accuracy

- By summarizing the fit we can see that 138 are correctly classified instances while 40 are wrongly classified.
**The accuracy obtained is 48%**by using C4.5 Algorithm.- The accuracy is very similar to the RIPPER Algorithm.

Finally, let us list out all the accuracy values from the various classifiers used in this article.

- KNN — 44%
- SVM — 55%
- OneR — 20%
- RIPPER — 48%
- C4.5 — 48%

Clearly, SVM has outperformed all the other classification techniques by a good margin. RIPPER and C4.5 were the closest, with both showing 48% accuracy, which is pretty impressive. OneR algorithm performed the worst with only 20% accuracy.

The post Classification Techniques On Life Expectancy Data appeared first on StepUp Analytics.

]]>The post Introduction To Machine Learning appeared first on StepUp Analytics.

]]>Machine Learning is the science of getting computers to learn and act like humans do, and improve their learning over time in an autonomous fashion, by feeding them data and information in the form of observations. Machine learning is used in many domains, right from predicting if the next movie will be a success at the box office or not to the nuances of the stock market such as predicting the stock price.

Coming to the Actuarial profession, Machine learning has applications in pricing, reserving, product design, capital modeling to name a few. It has been introduced in CS2 as well, thus covering applications of concepts such as time series, Lee-Carter, pspline regression models using R.

Machine learning can be broadly classified into Supervised, Unsupervised and Reinforcement learning. Curriculum 2019 mainly focuses on Supervised and Unsupervised Machine learning.

Let’s have a look at what it is:

Supervised learning, as the name suggests, indicates a presence of a supervisor as teacher. It is a learning in which we teach or train the algorithm using data which is well labelled, that means some data is already tagged with the correct answer (training data). This algorithm consist of a target / outcome variable (or dependent variable) which is to be predicted from a given set of predictors (independent variables). Using these set of variables, we generate a function that map inputs to desired outputs. After that, the algorithm is provided with a new set of data so that the supervised learning algorithm analyses and produces a outcome from the labelled data.

Supervised learning has been broadly classified into regression and classification problems. Both problems have the goal of construction of a good model that can predict the value of the dependent variable from the independent variables. The difference between the two tasks is the fact that the dependent variable is numerical for regression and categorical for classification.

**Regression:**A regression problem is when the output variable is a real or continuous value, such as salary or weight. Regression predictive modeling is the task of approximating a mapping function (f) from input variables (X) to a continuous output variable (Y). For example, predicting the annual expenditure (dependent variable) of a person by using his annual income as the independent variable.

**Classification:**A classification problem is when the output variable is a category, such as yes or no, black or white. A classification model attempts to draw some conclusions from observed values. Given one or more inputs a classification model will try to predict the value of one or more outcomes. Classification models include Logistic regression, Decision tree, Random forest, Naive Bayes, to name a few. For example, predicting whether a person will default on his next loan payment on the basis of his income.

Many actuarial modeling projects such as insurance contract pricing, pension scheme valuation fall into the category of supervised learning.

Unlike supervised learning, no teacher is provided which means no training will be given to the machine. The information is neither classified nor labeled and the task of the machine is to group unsorted information according to similarities, patterns, and differences without any prior training of data.

In this algorithm, we do not have any target or outcome variable to predict. It is used for clustering population in different groups, which is widely used for segmenting variables under study in different groups. Unsupervised machine learning can be classified into two categories of algorithms:

**Clustering:**It is the task of grouping a set of objects in such a way that objects in the same group (a cluster) are more similar to each other than to those in other groups (clusters). A clustering problem is when you want to discover the inherent groupings in the data, such as grouping customers by purchasing behavior. Some examples are:- given class of buyers, cluster based on the buyer attributes
- given a set of tweets, cluster based on the content of the tweet

**Association:**An association rule learning problem is where you want to discover rules that describe large portions of your data, such as people that buy X also tend to buy Y. For example, the rule found in the sales data of a supermarket would indicate that if a customer buys bread and butter together, they are likely to buy tomato ketchup. Such information can be used as the basis for decisions about marketing activities such as promotional pricing or product placements.

Though Machine Learning is a vast topic, identifying the correct technique according to your problem and the data and going ahead in a systematic way is the key.

By now you would have got a brief idea of what Machine Learning is. Here is a list of few resources which can help you in understanding the concept in detail from an actuarial perspective.

- https://www.actuaries.org.uk/documents/practical-application-machine-learning-within-actuarial-work
- https://www.actuaries.org.uk/documents/modelling-analytics-and-insights-data-maid-working-party-terms-reference

The post Introduction To Machine Learning appeared first on StepUp Analytics.

]]>The post Learning From Imbalanced Dataset appeared first on StepUp Analytics.

]]>A data set is called imbalanced if it contains many more samples from one class than from the rest of the classes. Data sets are unbalanced when at least one class is represented by only a small number of training examples (called the minority class) while other classes make up the

Majority.

In this scenario, classifiers can have good accuracy on the majority class but very poor accuracy on the minority class(es) due to the influence that the larger majority class has on traditional training criteria.

For example, in **medical diagnosis** of a certain cancer, if the cancer is regarded as the positive class, and non-cancer (healthy) as negative, then missing cancer (the patient is actually positive but is classified as negative; thus it is also called ―false negative) is much more serious (thus expensive) than the false-positive error.

The patient could lose his/her life because of the delay in the correct diagnosis and treatment. Similarly, if carrying a bomb is positive, then it is much more expensive to miss a terrorist who carries a bomb to a flight than searching an innocent person.

The unbalanced dataset problem appears in many real-world applications like text categorization, fault detection, fraud detection, oil-spills detection in satellite images, toxicology, cultural modeling, medical diagnosis.

Because of this unequal class distribution, the performance of the existing classifiers tends to be biased towards the majority class.

It includes resizing training datasets, cost-sensitive classifier, and the snowball method. Recently, several methods have been proposed with good performance on unbalanced data. These approaches include modified SVMs, k nearest neighbor (KNN), neural networks, genetic programming, rough set based algorithms, probabilistic decision tree and learning methods.

An easy Data level methods for balancing the classes consist of resampling the original data set, either by oversampling the minority class or by under-sampling the majority class until the classes are approximately equally represented.

Both strategies can be applied in any learning system since they act as a preprocessing phase.

The simplest method to increase the size of the minority class corresponds to random over-sampling, that is, a nonheuristic method that balances the class distribution through the random replication of positive examples. Nevertheless, since this method replicates existing examples in the minority class, overfitting is more likely to occur.

**Two types of Over-Sampling**

**Random Oversampling:** This sampling balances the data by randomly oversampling the minority class.

**Informative oversampling:** Informative oversampling uses a pre-specified criterion and synthetically generates minority class observations.

Under-sampling is an efficient method for classing-imbalance learning. This method uses a subset of the majority class to train the classifier. Since many majority class examples are ignored, the training set becomes more balanced and the training process becomes faster. The most common preprocessing technique is random majority under-sampling (RUS), IN RUS, Instances of the majority class are randomly discarded from the dataset.

**Random Oversampling:** Random undersampling method *randomly* chooses observations from majority class which is eliminated until the data set gets balanced.

**Informative Oversampling:** Informative undersampling follows a pre-specified selection criterion to remove the observations from majority class.

Here we have two better algorithms in Informative oversampling ** ****BalanceCascade **and** EasyEnsemble.**

**BalanceCascade:** It takes a supervised learning approach where it develops an ensemble of the classifier and systematically selects which majority class to the ensemble.

**EasyEnsemble:** At first, it extracts several subsets of the independent sample (with replacement) from majority class. Then, it develops multiple classifiers based on the combination of each subset with minority class. As you see, it works just like an unsupervised learning algorithm.

However, the main drawback of under-sampling is that potentially useful information contained in these ignored examples is neglected.

It handles imbalances by generates artificial data. It does not replicate and add the observations from the minority class. It is also a type of oversampling technique.

In synthetic data generation, we find** synthetic minority oversampling technique (SMOTE)** is a powerful and widely used method. SMOTE algorithm creates artificial data based on feature space (rather than data space) similarities from minority samples. We can also say, it generates a random set of minority class observations to shift the classifier learning bias towards minority class.

What smote does is simple. First, it finds the n-nearest neighbors in the minority class for each of the samples in the class . Then it draws a line between the the neighbors an generates random points on the lines.

See the above image so it finds the 5 nearest neighbors to the sample points. then draws a line to each of them. Then create samples on the lines with class == minority class.

In regular learning, we treat all misclassifications equally, which causes issues in imbalanced classification problems, as there is no extra reward for identifying the minority class over the majority class. Cost-sensitive learning changes this and uses a function **C(p, t) **(usually represented as a matrix) that specifies the cost of misclassifying an instance of class **t** as class **p**.

This allows us to penalize misclassifications of the minority class more heavily than we do with misclassifications of the majority class, in hopes that this increases the true positive rate. A common scheme for this is to have the cost equal to the inverse of the proportion of the data-set that the class makes up. This increases the penalization as the class size decreases.

**SVM AND IMBALANCED DATASETS
**The success of SVM is very limited when it is applied to the problem of learning from imbalanced datasets in which negative instances heavily outnumber the positive instances. Even though undersampling the majority class does improve SVM performance, there is an inherent loss of valuable information in this process.

**Practical implementation on Imbalanced Data Set**

Here, I will take the **“Credit card fraud Data Set”.**

The Credit Card Fraud Detection Problem includes modeling past credit card transactions with the knowledge of the ones that turned out to be fraud. This model is then used to identify whether a new transaction is fraudulent or not. Our aim here is to detect 100% of the fraudulent transactions while minimizing the incorrect fraud classifications.

**Observations**** **

- The data set is highly skewed, consisting of 492 frauds in a total of 284,807 observations. This resulted in only 0.172% fraud cases. This skewed set is justified by the low number of fraudulent transactions.
- The dataset consists of numerical values from the 28 ‘Principal Component Analysis (PCA)’ transformed features, namely V1 to V28. Furthermore, there is no metadata about the original features provided, so pre-analysis or feature study could not be done.
- The ‘Time’ and ‘Amount’ features are not transformed data.
- There is no missing value in the dataset.

**Inferences are drawn**

- Owing to such imbalance in data, an algorithm that does not do any feature analysis and predicts all the transactions as non-frauds will also achieve an accuracy of 99.828%. Therefore, accuracy is not a correct measure of efficiency in our case. We need some other standard of correctness while classifying transactions as fraud or non-fraud.
- The ‘Time’ feature does not indicate the actual time of the transaction and is more of a list of the data in chronological order. So we assume that the ‘Time’ feature has little or no significance in classifying a fraud transaction. Therefore, we eliminate this column from further analysis.

I have created “Jupyter notebook” on “credit fraud detection” using unbalanced data set.

To complete solution Please visit my GitHub Repo: Code-never ends

The post Learning From Imbalanced Dataset appeared first on StepUp Analytics.

]]>The post Application of Linear Regression Via Python’s sklearn Library appeared first on StepUp Analytics.

]]>Before wasting any further time, let’s dive into!

(If you are still confused between Regression and Classification tasks and need StepUp Analytics to do an article on it, just give us the orders We’ll be happy to help you out!).

- Introduction to Linear Regression
- The main purpose and its use case (problem domains)
- Basic hypothesis/mathematics behind the design of algorithm.
- Application of Linear Regression on a dataset via Python’s sklearn library
- Summary

Linear regression is an approach of linearly-mapping a relationship between a scalar input variable

(or dependent variable) and one or more continuous output variables (or independent variables). In some terminologies, the dependent variable is also referred to as the *Target Variable* and the independent variables as the *Predictor variables. *

Now, you can infer from the number of variables that when output is computed on the basis of a single variable then the approach is known as ** Simple Linear Regression**, and when two or more variables are accountable for the output then it is known as

In this article, we will cover the Simple Linear Regression to get a gist of what is actually happening behind the name.

As you may already know that a Machine Learning task is either classification (predicting a class/label) or regression (predicting a real-valued quantity). So, whenever you come across a regression task and you see a somewhat linear pattern between the input and the output variables then son, you may require the Linear Regression Model for your problem statement.

**The Core Purpose In Linear Regression Is To Obtain A Line That Best Fits The Data. **

Visualize the data problem as a graph as shown beside where all the points are values of the variable at that point. Each point (in blue) in the graph is known as a *data point* and the fitted line (in red) called a ** Regression Line**.

So the task of the algorithm is to find the best fit line for which the actual values are as close to the line as possible. In other words, the task is that the total prediction error is as small as possible, where the error is the distance between the points to the regression line.

The problem statement which is covered in this model is like:

- Predicting weight from the height of a person (Single Linear Regression)
- Predicting price of a house on the basis of its age, number of bedrooms, square ft area, etc. (Multiple Linear Regression)
- Predicting the height of a child on the basis of his father’s (Single Linear Regression)
- Predicting the market stock prices of a company based on its past performances, and so on

Keep in mind that the input variables can’t be categorical at all. In case you’re dealing with any, then it will be better than you convert them into numerical before feeding them into the model.

Since the beginning of this article, we have been using the word “*Linear*” which implicitly means a straight line. Just a little squeeze in your brain and the idea or basic algorithm behind the Linear Regression will automatically come to you. It will surely have the *straight line equation*, right? Let us see.

**Y = b _{0 }+ b_{1}X_{1} + b_{2}X_{2 }+ b_{3}X_{3 }+ ….**

The above equation is used in case of *Multiple Linear Regression* problems and the same for *Simple Linear Regression* would be updated as

**Y = b _{0 }+ b_{1}X_{1}**

Here,

**Y** = output/ target variable/ response

**b _{0}** = Bias Coefficient (adds intercept-wise flexibility to the line)

**b _{1}** = coefficient/parameter related to X

**b **= coefficients/parameters learned and updated while training/fitting of the model

Now, if you compare the equation with that of straight line then you can see the similarity between the two.

Until now, we discussed the hypothesis used in linear regressions tasks, but after the regression line is fitted and output is produced, we also need to evaluate it. We need to evaluate the model’s performance based on the actual outcome and the output that the model predicted. And, there comes the need for evaluation and optimization techniques.

Out of the many techniques currently used, the “*least squares”* is the easiest for beginners and is mostly used with Simple Linear Regression. What it actually does is, it just computes the distance between the actual value (plotted on the graph) and the value predicted by the model (i.e. value on the regression line) and square it. Now, the aim of the model will be to minimize this distance as much as it can. There are various other optimization algorithms like gradient descent which will be covered explicitly in further articles.

Mathematically, it is implemented using the formula below,

Where **n** = number of instances (or number of examples)

**Pred _{i}** = Predicted value for i

**Y _{i}** = Actual value for i

**J** = Cost Function (least squares in this case)

The code to the notebook can be accessed via

https://gist.github.com/srajan-jha/edf6e4673da408151366aad6d4e1ef27

Sklearn is a python’s library which has all the basic machine learning models already implemented in it. It also contains some classical datasets which you can play around with.

Sklearn makes the implementation of any machine learning model way too easier, however, I’d suggest you try and code the model from scratch. This way you’d get a clearer picture of the model.

To get a detailed understanding of what actually is happening in the code, feel free to reach out to us in the comments below!

In this article, you read about what actually is a Linear Model and how it is divided into Simple and Multiple Task depending upon the number of variables involved.

Moreover, we tried to understand the problem domains where this algorithm might stand out. Also, I tried to give you all a basic understanding of the mathematics involved in the name: Linear Regression. Then further we understood the need for optimization algorithms and also went through the Least Squares Method. The code implementation was also covered later which was done using SKLEARN Library. (Feel free to check out the Linear Regression Code from Sklearn).

You must know that Linear Regression is the most basic algorithm and it might not perform well for every other dataset. You must look out for the Linear Relation between the inputs and outputs and if you think it exists, then only move further.

At the end of the day, it all comes down to the accuracy of the model so if you think a Linear Regression isn’t performing good enough on your dataset, then you must know that it’s just a beginning and there are a lot of models which are still waiting to be discovered by you! Stay tuned with us and get to know them all.

The post Application of Linear Regression Via Python’s sklearn Library appeared first on StepUp Analytics.

]]>The post Model Selection Based on Cross Validation in R appeared first on StepUp Analytics.

]]>In statistics, Model Selection Based on Cross Validation in R plays a vital role. The prediction problem is about predicting a response (either continuous or discrete) using a set of predictors (some of them may be continuous, others may be discrete). The solution to such a problem is to build a prediction model on a training sample. This process presents two (related) challenges:

We often have many candidate models (e.g. regression, tree, neural net, SVM, etc). Each model may have many sub-models specified by hyper-parameters that need to be tuned for optimal prediction performance (e.g. variable selection, shrinkage/penalty factor, smoothing/complexity parameter, Bayesian hyper-priors, etc.). In order to choose the (approximate) best model, we need to estimate objectively the performance of different models and their sub-models.

Once we decide on the best model, we want to estimate its test error by making a prediction on a new sample, which provides an objective judgment on the performance of this final model.

Besides the test error, other model assessment tools such as ROC Curve or Calibration Plot may be useful.

To estimate the test errors of different models and to assess the final model objectively, we shall ideally split the dataset into three parts

**Training Sample **It is used for model estimation by estimating the model parameters. It can be 50% of the data.

**Validation Sample **It is used for model selection by estimating the test error of each candidate model. It can be 25% of the data.

**Test Sample **It is used for model assessment by estimating the test error of the final chosen model. It can be the remaining 25% of the data.

If the data is insufficient to be split into three parts, we can drop the Test Sample if we can accept (slightly) biased model assessments. In addition, we can drop the Validation Sample and use CV (Cross Validation) or Bootstrap method on the Training Sample to estimate the test errors to do the model selection.

Basically, the CV or Bootstrap method use the training sample efficiently by generating internal small validation samples. Alternatively, we can use analytical estimators such as AIC or BIC, which may not be available for some models. In general, my recommendation is to use 5-fold or 10-fold CV.

Because analytical model selection metrics such as AIC or BIC are not universally available for all models (e.g. trees, SVM), we usually use:

- Cross-Validation(CV)
- Bootstrap for model selection[1]

In general, Cross-Validation(CV) and Bootstrap have similar performance. However, the Bootstrap method (e.g. the “.632+” estimator) is generally more computationally intensive than CV. What’s more, the concept of CV is simple and easy to communicate

**How CV Works** Consider a 10-fold CV, we would split the Training Sample into 10 roughly equal-sized parts. For the 1st part, we fit the model to the other 9 parts of the data, and calculate the prediction error, 1, of the fitted model on this 1st part of the data; then we repeat this process on the 2nd part of data, producing 2, and so on and so forth.

Note in this way each observation is predicted exactly once, as an out-of-sample prediction. In the end, we will have 10 error estimates: **e1**, **e2**, . . . , **e10**, which can be averaged into one **e** (CV Error ) and which can be used to compute the standard error of E.

Alternatively, if the computing resource is ample, then we can, for example, repeat the CV procedures, using different 10-fold partitions each time, for 20 times and estimate the standard error of the CV Error. In the model selection process, it would be very informative to plot out this CV Error, along with its standard error bar, to facilitate the model comparison.

Often we apply a one-standard error rule, in which we choose the most parsimonious model whose error is no more than one standard error above the error of the best model.

Bias If the training sample size is small, CV will, in general, overestimate the test error (because the model does not see enough data to show its full “strength”). This estimation bias will be negligible if we have sufficient sample size. 2 For CV-5, each training model uses only 80% of the Training Sample. For CV-10, 90%. Thus the bias of CV-5 error is higher than that of CV-10.

Variance For the same reason above (CV-5: 80%; CV-10: 90%), the training models of CV-10 are more similar (correlated) to each other than those of CV-5. Thus the variance of CV-10 error is higher than that of CV-5.

Computing Time CV-10, which fits a model 10 times, will take about twice as long to run in the computer as CV-5, which fits a model 5 times.

Since it is generally hard to know the bias-variance trade-off between CV-5 and CV-10, we recommend using the computing time to make a choice: if a model takes a long time to fit, use CV-5; otherwise use CV-10.

**Examples of Model Selection and Assessment**

Selection of hyper-parameter Considers the best subset regression of size p. We can choose **p** **best** by using CV-10 on Training Sample. Thus we may not need Validation Sample here. Then we fit a new best subset regression of size **p** **best** on the entire Training Sample. The fitted regression is then assessed using the Test Sample, yielding an estimate of test error, say, RMSE (Root Mean Squared Error).

When we are satisfied with this model, we can combine Training, Validation, and Test Samples together and fit a new best subset regression of size **p best** on the pooled data. This final model can then be deployed in Production, where it is used for prediction on the new data set. Feature selection Suppose we have a data of 100 observations and 1000 variables.

Naturally, we want to select some important variables before modeling. Unless we are using unsupervised learning methods (e.g. PCA, ICA, etc.) to select/summarize the variables, any supervised learning method for selecting variables, where the knowledge of the response is utilized, shall be carried out independently and repeatedly in each training set of the CV procedure.

In other words, CV is fair only if the variable selection is considered as part of the prediction model. ROC curve When the response is dichotomous, we often use two metrics in lieu of test error:

**Sensitivity = 1 – (Misclassification Error when Response = TRUE)****Specificity = 1 – (Misclassification Error when Response = FALSE)**

There is often a key tuning parameter, called Cut-off, that dictates the trade-off between sensitivity and specificity. The trade-off can be illustrated by the ROC (Receiver Operating Characteristic) curve.

The post Model Selection Based on Cross Validation in R appeared first on StepUp Analytics.

]]>The post Gradient Descent: Relating It With Real Life Analogies appeared first on StepUp Analytics.

]]>Consider the case of self-driving cars. There is no one inside looking for pedestrians and thus operating the car accordingly rather what happens is, the model fitted in the car itself detects them and drives appropriately, pulling on the brakes, slowing down the speed or whatsoever action is needed. Now, imagine that the fitted model isn’t accurate; hence it won’t be able to watch out pedestrians or fellow cars and would end up crashing several lives at risk.

*How will we be able to evaluate our model? How can we judge whether it is performing according to our need or not?* This evaluation is done by the calculation of a *COST FUNCTION*, which is just a mapping function that will tell us the difference between the desired result and what our model is computing. Once a cost function is computed, it is *our *duty to correct the model whenever it does something undesirable.

Think of your model as your own child; so whenever your child (model) will do anything wrong, you’d correct him until a time comes that he can distinguish between the do’s and don’ts. If talking in sense of model, then this time will come when the accuracy would become what is actually required. This correction is done using *Optimization Algorithms.*

As the initial cost function for the model is computed, it is judged and hence optimized with the help of these Optimization Algorithms. This cycle of computing and optimizing the cost function goes on until the desired accuracy level is reached. So, optimization algorithms are helpers which minimize the cost function (or error rate) thus maximizing the accuracy. And talking about the Optimization Algorithms, *GRADIENT DESCENT* is the name which comes at the very first step of your long journey. What is it?

Gradient Descent is the basis for more powerful optimizing algorithms which are currently being used in Deep (as well as Machine) Learning. So, it is necessary to lay the foundation strong, right?

Gradient descent is an optimization algorithm which iteratively finds the values of learnable parameters of a function (f) to minimize the cost function (or error rate).

It is also known as the First-order iterative optimization technique as it computes the first-order derivative of the cost function wrt the learnable parameters (say weights).

**Relating Gradient Descent with real-life analogies:**

- Think of a valley which you want to descend. The game is that you’re blindfolded. What a sane human might do is, move a step and check for the slope of the valley i.e. whether it is going up or down. Then, proceed to follow the downward slope of the valley, and repeat the same step again and again until you reach the minima!
- Another highly-used analogy is, suppose you have a ball which is placed on an inclined plane. According to the laws, it will roll until it finds a gentle plane, where it will ultimately stop.

That exact situation happens in gradient descent, the inclined and/or irregular is the cost function when plotted. And the job of the gradient descent is to provide the idea of direction and velocity of the movement to reach the optima or minima of the function, where the cost is least.

So now, hope you know the task from the analogies we just went through. It is to reach the minimum value of the cost function. Let us now see how it is actually accomplished and is implemented.

As the cost function is computed for random values of learnable parameters (weights) in the very first step; it is then evaluated with the help of gradient descent. What happens at this stage is, a derivative of the cost function is calculated with respect to each input, learnable parameter. The sign of the derivative (whether +ve or -ve) decides in which direction next step is to be taken while updating the parameters.

After computing the gradient /derivative, the parameters are updated as follows:

**PARAMETER = PARAMETER – (LEARNING_RATE * GRADIENT)**

- LEARNING_RATE (denoted by α) decides the size of steps which the algorithm will take while descending the cost function. From the figure below, you can infer that choosing an appropriately smaller α would be wise. In case you chose a larger value, then the situation on the left
*(in the figure)*will arise hence the function would now be able to converge. - The GRADIENT is the derivative of the cost function wrt the parameter being updated from the above equation,if the sign of gradient/derivative is negative, the value parameter is increased thus, the function will move to the right of the slope (
*Initially being on the left side of the minima*). Similarly,if the sign is positive (*i.e. the initial position is on the right of minimal**as in the figure*) then the value of the parameter will decrease thus moving towards the minima at the left.

On the basis of the number of instances (training examples) being looked over, before updating the parameters, the Standard Gradient Descent can be classified as:

- Batch Gradient Descent
- Stochastic Gradient Descent
- Mini-batch Gradient Descent

The *Batch Gradient Descent is the Standard Gradient Descent Technique in which the algorithm will* calculate the gradient of the whole dataset and *will perform only one update. Hence for large datasets, it can be too time as well as space consuming.*

Stochastic Gradient Descent, on the other hand, updates the parameters for **each training example**. It is hence evaluated as a much faster technique. However, *due to the repeated updates, fluctuations are noticed when the training/learning of the model is plotted with time (in the figure). These fluctuations keep on overshooting the algorithm instead of attaining the global minima.*

**In Mini-batch Gradient Descent** (most useful out of the other two), batches of n training examples are made. The gradient thus goes through one batch at a time and performs an update for each. The size of a batch is decided by the programmer while understanding the size of the data. It is however found in the studies that it should be of the order 2^{n} for fast and efficient computations.

- Why evaluation of the model is necessary
- How model evaluation is done using Cost Function
- The Need for Optimization Algorithms
- Introduction to Gradient Descent
- The gist of Derivatives, Global Minima, Slope Finding
- How the updating process goes:
- Random initialization
- Cost (function) computation
*Slope/Gradient/Derivative*finding using Gradient Descent- Updation of parameters
- Repeat processes B to D until global minima reached

- Types of Gradient Descent:
- Batch GD or Standard GD
- Stochastic GD
- Mini-batch GD, which is being used over the other two variants

**This Example Depicts The Accuracy And Cost Function Rate For The Three Variants Of Gradient Descent: i.e. Batch, Stochastic, Mini-Batch**

### Importing datasets and important libraries ### import numpy as np import matplotlib.pyplot as plt import math import sklearn import sklearn.datasets from opt_utils import load_params_and_grads, initialize_parameters, forward_propagation, backward_propagation from opt_utils import compute_cost, predict, predict_dec, plot_decision_boundary, load_dataset ### Setting up the frame size for plotting in future ### %matplotlib inline plt.rcParams['figure.figsize'] = (7.0, 4.0) # setting default plot size plt.rcParams['image.interpolation'] = 'nearest' plt.rcParams['image.cmap'] = 'gray'

def update_params_with_gd(params, grads, learning_rate): """ Update parameters using one step of gradient descent Arguments: params -- python dictionary containing your parameters to be updated: parameters['W' + str(l)] = Wl parameters['b' + str(l)] = bl grads -- python dictionary containing your gradients to update each parameters: grads['dW' + str(l)] = dWl grads['db' + str(l)] = dbl learning_rate -- the learning rate, scalar. Returns: parameters -- python dictionary containing your updated parameters """ L = len(params) for l in range(int(L/2)): params['W' + str(l+1)] = params['W' + str(l+1)] - learning_rate * grads['dW' + str(l+1)] params['b' + str(l+1)] = params['b' + str(l+1)] - learning_rate * grads['db' + str(l+1)] return params

**Figure showing the different approach of Batch, Stochastic and Mini-batch Gradient Descents**

**Figure 1**: **SGD vs GD**

“+” denotes a minimum of the cost. SGD leads to many oscillations to reach convergence. But each step is a lot faster to compute for SGD than for GD, as it uses only one training example (vs. the whole batch for GD).

**Figure 2**: **SGD vs Mini-Batch GD**

“+” denotes a minimum of the cost. Using mini-batches in your optimization algorithm often leads to faster optimization.

def random_mini_batches(X, Y, mini_batch_size, seed = 0): """ Creates a list of random minibatches from (X, Y) Arguments: X -- input data, of shape (input size, number of examples) Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (1, number of examples) mini_batch_size -- size of the mini-batches, integer Returns: mini_batches -- list of synchronous (mini_batch_X, mini_batch_Y) """ np.random.seed(seed) m = X.shape[1] mini_batches = [] # Shuffling permutation = list(np.random.permutation(m)) shuffled_X = X[:, permutation] shuffled_Y = Y[:, permutation].reshape((1,m)) # Partitioning num_complete_minibatches = math.floor(m/mini_batch_size) for k in range(0, num_complete_minibatches): mini_batch_X = shuffled_X[:, k * mini_batch_size: (k+1) * mini_batch_size] mini_batch_Y = shuffled_Y[:, k * mini_batch_size: (k+1) * mini_batch_size] mini_batch = (mini_batch_X, mini_batch_Y) mini_batches.append(mini_batch) # Last minibatch if (m % mini_batch_size != 0): mini_batch_X = shuffled_X[:, num_complete_minibatches * mini_batch_size : ] mini_batch_Y = shuffled_Y[:, num_complete_minibatches * mini_batch_size : ] mini_batch = (mini_batch_X, mini_batch_Y) mini_batches.append(mini_batch) return mini_batches

Loading Model

train_X, train_Y = load_dataset()

def model(X, Y, layers_dims, optimizer, learning_rate = 0.0007, num_epochs = 10000, print_cost = True): """ 3-layer neural network model which can be run in different optimizer modes. Arguments: X -- input data, of shape (2, number of examples) Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (1, number of examples) layers_dims -- python list, containing the size of each layer learning_rate -- the learning rate, scalar. num_epochs -- number of epochs print_cost -- True to print the cost every 1000 epochs Returns: parameters -- python dictionary containing your updated parameters """ L = len(layers_dims) costs = [] t = 0 seed = 10 # Initializing params params = initialize_parameters(layers_dims) if (optimizer == 'bgd'): print("Costs for Batch Gradient Descent") mini_batch_size = X.shape[1] elif (optimizer == 'sgd'): print("Costs for Stochastic Gradient Descent") mini_batch_size = 1 elif (optimizer == 'mbgd'): print("Costs for Mini-Batch Gradient Descent") mini_batch_size = 64 # Optimization Loop for i in range(num_epochs): seed = seed + 1 minibatches = random_mini_batches(X, Y, mini_batch_size, seed) for minibatch in minibatches: # selecting a minibatch (minibatch_X, minibatch_Y) = minibatch # forward prop a3, caches = forward_propagation(minibatch_X, params) # compute cost cost = compute_cost(a3, minibatch_Y) # back prop grads = backward_propagation(minibatch_X, minibatch_Y, caches) # update params params = update_params_with_gd(params, grads, learning_rate) if (print_cost and i % 1000 == 0): print("Cost after epoch %i : %f" % (i, cost)) if (print_cost and i % 100 == 0): costs.append(cost) plt.plot(costs) plt.ylabel("cost") plt.xlabel("epochs (per 100)") plt.title("Learning rate = "+str(learning_rate)) plt.show() return params

# train 3-layer model layers_dims = [train_X.shape[0], 5, 2, 1] parameters = model(train_X, train_Y, layers_dims, optimizer="bgd") # Predict predictions = predict(train_X, train_Y, parameters) # Plot decision boundary plt.title("Model with Batch Gradient Descent optimization") axes = plt.gca() axes.set_xlim([-1.5, 2.5]) axes.set_ylim([-1, 1.5]) plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)

# train 3-layer model layers_dims = [train_X.shape[0], 5, 2, 1] parameters = model(train_X, train_Y, layers_dims, optimizer="sgd") # Predict predictions = predict(train_X, train_Y, parameters) # Plot decision boundary plt.title("Model with Stochastic Gradient Descent optimization") axes = plt.gca() axes.set_xlim([-1.5, 2.5]) axes.set_ylim([-1, 1.5]) plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)

# train 3-layer model layers_dims = [train_X.shape[0], 5, 2, 1] parameters = model(train_X, train_Y, layers_dims, optimizer="mbgd") # Predict predictions = predict(train_X, train_Y, parameters) # Plot decision boundary plt.title("Model with Mini-batch Gradient Descent optimization") axes = plt.gca() axes.set_xlim([-1.5, 2.5]) axes.set_ylim([-1, 1.5]) plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)

**Download** the Full Code Snippet

In usual cases, the mini-batch gradient descent optimizer performs much better as compared to the batch or stochastic gradient descent, but in this case, as you can see the number of epochs (10000) is way greater than usual therefore the model is overfitting when using Stochastic GD. And that is why the accuracy of the model with Stochastic GD optimizer is 94.67%.

What’s happening here is, the system is looking at the same training instance/example several times and then finally moving to the next instance. This way the system knows each and every pattern of the instances, or we can say it has mugged it all up instead of just learning the concept. This is obviously not desired in the usual scenarios, so even if the accuracy is 94.67% it is not said to be a good model.

However, in the case of the batch or mini-batch gradient descent, the accuracy is what it will usually give off in almost all cases. So you can infer from this that the Mini-Batch GD optimizer with an accuracy of 79.67% wins over Batch GD optimizer which has an accuracy of 66% here.

The post Gradient Descent: Relating It With Real Life Analogies appeared first on StepUp Analytics.

]]>