The post Extending Stochastic Models for Autonomous Vehicle Systems appeared first on StepUp Analytics.

]]>*The word “stochastic” comes from the Greek stokhazesthai, which means to aim at or guess at. A stochastic process also called a random process, in which outcomes are ambivalent. Randomness and the probability distribution of each process are basic key elements of stochastic.*

*For example, (according
to the cancer survey) there are 50 possible metastatic locations for cancer
spread in human body. By applying stochastic modeling to get the information
from randomness input data, if we increase the trails, then we get conclusion
on how cancer spread, which is next organ to be affected, which organ don’t have
any interrupt from cancer and finally decide what kind of treatment should give
to the patient.*

*Stochastic models will
help us to solving every complex problem which has randomness by nature. *

*It has some following key elements to bring intelligence on its own way,*

*Random** walks** are process of taking randomness input in unpredictable way to get
corresponding output, here it doesn’t take any past data to predict future.*

*The word “Random walks” was given by mathematician Karl Pearson (1857 – 1936) in 1905. It is very useful in algorithmic trading, and also it may have applied on air traffic collision detection and safety**Brownian motion or wiener process** was discovered by biologist Robert Brown (1773 – 1858), to learn the particle movements under a microscope. But it was mathematically described by Norbert Wiener (1894 – 1964). It is a continuous-time stochastic process and a subset of the Levy Process (stochastic process with independent, stationary increments). It will generate a pattern for randomness, which means if you increase trails of sample spaces then in particular moment the randomness will be changed into deterministic one, so we will able to predict the outcomes. It widely used in finances sectors, biochemistry and AI etc.,*

*Poisson process** is a stochastic process in which a number of events are counted in a given interval of time. The time could be inter-arrival times which should be independent of one another. The Poisson process is a continuous process. It helps us to predict the probability of certain events happening in a fixed interval of time.* *It is named after Siméon Denis Poisson who discovered it in 1838. For examples, it will help Autonomous Vehicles to find the number of occurrences of traffic signs in the road and make the prediction of what is next sign should be? with the help of other properties in stochastic.*

*Markov Chain** is a stochastic process that moves
from one state to another, the new state depends only on the current position
and not on the historical positions. It also known as the memoryless property (Markov
property) of a stochastic process. The change of positions can be represented
with a transition matrix. Weiner process and Poisson process are a subset of
Markov process (continuous type of markov chain). Markov chain is named after
the Russian mathematician Andrey Markov (1856 – 1922). If we apply intelligence
in cancer prediction in our body with markov chain properties, it will lead us
to get following information*

*What should be the next organ will be affected by cancer?**Which organ won’t be affected by cancer?**What is the probability of the patient saving rate from cancer?**What kind of treatment should be given to the patient?*

*So it helps doctors to give correct treatment to the patients.*

*Why Markov chains are needed to apply for experimentation like AVS wherein the scenarios are randomized & behavior changes based on the randomness.*

*Markov decision process (extension of markov chain) is one
of the example for reinforcement learning. Which means ***y = f (x) z, ***we give x and z to
learn function f to generate y similarly y helps to define function f.*

*It takes decision making in every situation only depend up
on current state. *

*I will explain the Markov* *decision process as the following example, **Imagine that following table is the grid of places of a city, An AVS need to reach Goal State from Start State by execution actions. If you found the boundary in the given table like (1,1) if you tried to go UP, then you should stay where you are. Similarly, if you tried to go LEFT, then you should stay where you are. similarly, if you tried to go RIGHT, then you can move next place*

*Actions: UP, DOWN, LEFT, RIGHT**Question:**What is the shortest sequence getting from Start to Goal? **Ans1: UP, UP, RIGHT, RIGHT, RIGHT **Ans2: RIGHT, RIGHT, UP, UP, RIGHT**Both answers are given the correct solution to this problem, so I get the first one.*

*Markov property,**In order to know the information of the near future (say, at time t+1) the present information at time t matters.**Given a sequence,*

*The** first order of Markov says,*

*That is, Xt depends only on Xt-1. Therefore, Xt+1 will depend on Xt*

*The second order of Markov says, *

*that is, Xt depends only on Xt-1 and Xt-2*

*Only present matter**Stationary (Rules are no Change)*

*Markov Decision Process**STATES: S **MODEL: T(s,a,s’) ~ P(s’|s,a)**ACTION: A(s), A **REWARDS: R(s), R(s,a), R(s,a,s’)**POLICY: π(s) –> a ** π * (optimal policy)*

*STATES: S **(from above example there are 12 states available)**States are feature representation of data collected from the environment.**It can be either discrete or continuous.*

*MODEL: T(s, a, s’) ~ P(s’|s, a)**Model or transition model describes that rules of this example to reach goal state, it’s basic function of three variables CURRENT STATE(s), ACTIONS(a), NEW STATE(s’), it will produce the probability of landing up on new state (s’) given that the agent take action (a) in given state (s). It tells, what will happen if you do something in a particular place?*

*In a deterministic environment, where the probability for any landing state other than the determined one will have zero probability.*

*For example: *

*Determined environment:**If you take a certain action, go Up, you will certainly perform that action with probability 1.*

*Stochastic environment*: If you takesame action, go Up, there will certain probability say 0.8 to actually perform the given action and there is 0.1 probability it can perform an action (either Left or Right) perpendicular to the given action, Up. Here for the s state and the Up action transition model, T(s,a,s’) = P(s’|s,Up)=0.8

*It follows the first order Markov property. So we can also say that Autonomous Vehicle is also a stochastic environment because AVS is composed of Decision making that is in different states defined by position and speed or other attributes of AVS. Actions performed by each decision making process change their states and cause a change in the AVS.*

*ACTION: A(s), A**ACTION can be perform at particular state **EX: Up, Down, Left, Right. It can also be either discrete or continuous)** A= {Up, Down, Left, Right}*

*It can be treated as a function of the state, a = A(s), where depending on the state function, it decides which action is possible.*

*REWARDS: R(s), R(s,a), R(s,a,s’)**The reward of the state quantifies the usefulness of entering into a state. There are three different forms to represent the reward namely, R(s), R(s, a) and R(s, a, s’), but they are all equivalent. The domain knowledge plays an important role in the assignment of rewards for different states as minor changes in the reward do matter for finding the optimal solution to an MDP problem. It is a scalar value.*

*Ex: This is an example for R(s)**Goal state gets a reward as 1 and Not Goal state get reward value as -1. R(Goal)=1**R(Not Goal) = -1*

*POLICY: π(s) –>a**The policy is a function that takes the state as an input and outputs the action to be taken. Therefore, the policy is a command that the agent has to obey.*

*It is a guide telling which action to take for a given state. **π * (optimal policy), which maximizes the expected reward.*

*From these MDP properties, you get an idea that how to implement MDP (reinforcement learning) to an Autonomous vehicle. For example, An autonomous vehicle operates in an idealized grid-like city, with roads going North-South and East-West. There are traffic signs are as a state (S). Then model or transition model(T) will be telling to AV, what will be the next signs appear on the road to help AV to make decision. *

*If AV determined the stop sign, then it takes an action(A) that release the throttle and decrease the speed on stop the vehicle. An AV gets rewarded (R) for each successful completion of the trip. From the rewards and penalty AV should learn optimal policy (π *) for driving on a city road, obeying traffic rules correctly, and trying to reach the destination within a goal time.*

*What are the things need to be done to achieve AVS level 4?*

*Level 0: No Automation**Level 1: Driver Assistance Required**Level 2: Partial Automation Options Available**Level 3: Conditional Automation**Level 4: High Automation**Level 5: Full Automation*

*Currently we’re in level 3, Audi claims that the new A8 is
the first production car to achieve Level 3 autonomy—not Level 4 as Motor Trend
claims.* *The
Audi AI traffic jam pilot can take over the tedious job of creeping through
highway traffic jams at speeds below 37 MPH.*

*Compare with level 3, level 4 is highly automated. Because in level 3, car driver can go far above 37 MPH then the car autonomy will be ruled out. In that kind of situation, the driver needs to take responsibility for the car. So level 4 is Autonomous Vehicles will be able to handle most “dynamic driving tasks” to use SAE International’s terminology. *

*Which means, Level 4 car can handle most normal driving tasks on its own. But we still need driver intervention from time to time, during poor weather condition. So now, following things need to be done to achieve AVS Level 4,*

*They
are:*

*A level 3 autonomous car should understand the SAE International’s terminology to move to level 4**An AVS level 4, capable of performing all driving function under certain conditions so we need to*introduced morereal world driving problems withcondition to solve by an AV then it moveslevel 3 to level 4.*Make level 3 AVS upgrade into level 4 by introducing reinforcement learning, which means if an AVS already know what traffic signs will appear next then it will aware and make*decision on it, so here driver or other people in car treated as cargo!*In level 3 AVS, in*specific situation and environment AV may not have any interrupt on its way (likehighway ) then it allowshuman driver to do whatever if they want. But we need to upgrade this specialization in level 4 as that an AV can drive itself independently inmost environment, with some exceptions for weather or unusual environments.Human may still need to take over attime . By introducing RADER, LIDAR, GPS, Digital Cameras, Processors to upgrade level 3 AV into level 4.

*Demonstration of how Markov chain is useful in achieving AVS level 4 by considering one the scenarios of road-sign detection & action*

*An Autonomous Vehicle detects the traffic signs on road by using the camera and classified with the help of kNN algorithm, here it uses Markov chain to predict, what will be a next traffic sign according to the current traffic sign. Here it generates the Markov transition matrix based on Trichy to Madurai national highway in Tamilnadu, India.*

*By taking high matrix powers gives the limiting distribution is*

*So following R code explains that Autonomous Vehicle simulated for 100 Kilometers. During that distance, AVS detect traffic Symbols and their probability of appearing to be next after current traffic signs. Then we increase a sample space to get a more accurate probability of traffic signs will predict next to take a decision by AVS. Compare this simulated result to million step simulation.*

markov <- function(init,mat,n,labels) {

if (missing(labels)) labels <- 1:length(init)

simlist <- numeric(n+1)

states <- 1:length(init)

simlist[1] <- sample(states,1,prob=init)

for (i in 2:(n+1))

{ simlist[i] <- sample(states,1,prob=mat[simlist[i-1],]) }

labels[simlist]

}

P <- matrix(c(0,0.8,0.17,0.28,0.28,0,0.02,0.01,0.08,0.08,

0.08,0.02,0.1,0.22,0.18,0.02,0.3,0,0.08,0,

0.02,0.2,0,0.3,0.38,0,0.08,0,0.02,0,

0.22,0.4,0.02,0,0.08,0.09,0,0,0.11,0.08,

0.28,0,0.1,0.1,0.01,0.09,0.1,0.12,0.08,0.12,

0.4,0,0.2,0,0,0.04,0,0.3,0,0.06,

0,0.3,0.01,0,0.07,0.3,0,0.2,0.1,0.02,

0,0,0.2,0,0,0.45,0.25,0,0,0.1,

0,0,0.1,0.05,0,0,0.25,0,0.2,0.4,

0,0,0.1,0.05,0,0.01,0,0.37,0.33,0.14), nrow=10, byrow=TRUE)

lab <- c(“Stop”,”Speed_Limit”,”Speed_Break”,”Pedestrain”, “School_Ahead”, “Man_At_Work”, “Narrow_Road”, “Road_Wideness”, “Hospital”,”Petrol”)

rownames(P) <- lab

colnames(P) <- lab

init <- c(1/10,1/10,1/10,1/10,1/10,1/10,1/10,1/10,1/10,1/10) # initial distribution

states <- c(“St”,”Sl”,”Sb”,”Pt”,”Sch”,”M”,”Nr”,”Rw”,”H”,”P”)

# simulate chain for 100 steps

simlist <- markov(init,P,100,states)

simlist

table(simlist)/100

steps <- 1000000

simlist <- markov(init,P,steps,states) table(simlist)/steps

*Output:*

simlist <- markov(init,P,100,states)

simlist

[1] “M” “St” “P” “Rw” “Sb” “H” “P” “H” “Nr” “M” “St” “Sb” “Sch” “St” “Sch” “St” “Sl” “Pt” “H”

[20] “Nr” “Sl” “Sch” “St” “Sl” “Sb” “Sch” “H” “P” “Rw” “M” “St” “Nr” “Sl” “M” “Rw” “M” “P” “P”

[39] “Rw” “M” “Rw” “M” “Rw” “M” “Sb” “Sl” “H” “H” “P” “Rw” “M” “Rw” “Nr” “M” “St” “Sb” “Sch”

[58] “M” “Sb” “Pt” “H” “P” “Rw” “Nr” “Rw” “Nr” “M” “St” “Sl” “Nr” “Sl” “Sch” “Nr” “Sl” “Pt” “St”

[77] “Sch” “Nr” “M” “St” “Sl” “Sb” “Sch” “St” “Sch” “P” “P” “H” “P” “H” “Sb” “Nr” “Sl” “Pt” “Sl”

[96] “Nr” “Sl” “Nr” “Rw” “P” “Rw”

table(simlist)/100

simlist

H M Nr P Pt Rw Sb Sch Sl St

0.09 0.13 0.12 0.11 0.04 0.12 0.08 0.09 0.12 0.11

steps <- 1000000

simlist <- markov(init,P,steps,states)

table(simlist)/steps

simlist

*This methodology helps Autonomous vehicle to take decision early by predicting the future traffic signs on road. It might easy to take decision on identified information.*

*How AVS also can recognize (image) road-signs to further achieve the level 4?**Autonomous Vehicle uses OpenCV, cameras to recognize road signs with the help of machine learning algorithms. Here, we use the k-Nearest Neighbor algorithm in R to predict the traffic signs. So following steps will explain how kNN helps AV to recognize the road-signs (images).*

**Recognizing a road sign with kNN**

After several trips with a human behind the wheel, it is time for the Autonomous Vehicle to attempt the test course alone. As it begins to drive away, its camera captures the following image:

Can you apply a kNN classifier to help the car recognize this sign?

**R Code:**

# Load the ‘class’ package

library(class)

library(tidyverse)

traffic_signs <- read_csv(“E:/internship in stepup analytics/extened stochastic model for autonomous vehicle system/traffic_sign3.csv”)

signtype<-traffic_signs$sign_type

nextsign <- traffic_signs[c(343),c(3:50)]

**# Classify the next sign observed**

knn(train = traffic_signs[-c(1:2)], test = nextsign, cl = signtype)

**Output:**

knn(train = traffic_signs[-c(1:2)], test =

[1] stop

Levels: hospital man_at_work narrow_road pedestrian petrol road_wideness school speed speed_break stop* we’ve trained our first nearest neighbor classifier! The AV successfully identified the sign and stopped safely at the intersection. So how did the knn() function correctly classify the stop sign? The answer is that the sign was in some way similar to another stop sign. So kNN isn’t really learning anything; it simply looks for the most similar example.*

*Exploring the traffic sign dataset**To better understand how the knn() function was able to classify the stop sign, it may help to examine the training dataset is used. Each previously observed street sign was divided into a 4×4 grid, and the red, green, and blue level for each of the 16 center pixels are recorded as illustrated here.*

*The result is a dataset that records the sign_type as well as 16 x 3 = 48 color properties of each sign.*

*R Code:*

*# Examine the structure of the signs dataset*

str(traffic_signs)

*# Count the number of signs of each type*

table(traffic_signs$sign_type)

*# Check r10’s average red level by sign type*

aggregate(r10 ~ sign_type, data = traffic_signs, mean)

*Output:*

aggregate(r10 ~ sign_type, data = traffic_signs, mean)

*As you might have expected, stop signs tend to have a
higher average red value. This is how kNN identifies similar signs.*

*Classifying a collection of road signs**Now that the autonomous vehicle has successfully stopped on its own, we feel confident in allowing the car to continue the test course. The test course includes 59 additional road signs divided into ten types:*

*At the conclusion of
the trial, you are asked to measure the car’s overall performance at
recognizing these signs.*

*R code:**# Use kNN to identify the test road signs*

signtypes2<-traffic_signs[c(traffic_signs$sample == “train”),c(2)]

signtypes<-t(signtypes2)

testsigns<-traffic_signs[c(traffic_signs$sample == “test”),c(3:50)]

trainsigns<-traffic_signs[c(traffic_signs$sample == “train”),c(3:50)]

signspred <- knn(train = trainsigns, test = testsigns, cl = signtypes)

*# Create a confusion matrix of the actual versus predicted values*

testsigns2<-traffic_signs[c(traffic_signs$sample == “test”),c(2:50)]

signsactual <- testsigns2$sign_type

table(signspred,signsactual)

*# Compute the accuracy*

mean(signspred == signsactual)

*Output:*

table(signspred,signsactual)

mean(signspred == signsactual)

[1] 0.9464286

*That Autonomous Vehicle is really coming along! The confusion matrix lets you look for patterns in the classifier’s errors.*

*There is a complex relationship between k and classification accuracy. Bigger is not always better. In such case, with smaller neighborhoods, kNN can identify subtler patterns in the data. So what is a valid reason for keeping k as small as possible (but no smaller)? *

*Answer is that a smaller k may utilize subtler patterns.*

*Testing other ‘k’ values**By default, the knn() function in the class package uses only the single nearest neighbor. Setting a k parameter allows the algorithm to consider additional nearby neighbors. This enlarges the collection of neighbors which will vote on the predicted class.*

*Compare k values of 1, 7, and 15 to examine the impact on traffic sign classification accuracy.*

*R Code:**# Compute the accuracy of the baseline model (default k = 1)*

k_1 <- knn(train = trainsigns, test = testsigns, cl = signtypes)

mean(signsactual == k_1)

*# Modify the above to set k = 7*

k_7 <- knn(train = trainsigns, test = testsigns, cl = signtypes, k = 7)

mean(signsactual == k_7)

*# Set k = 15 and compare to the above*

k_15 <- knn(train = trainsigns, test = testsigns, cl = signtypes, k = 15)

mean(signsactual == k_15)

*Output:*

k_1 <- knn(train = trainsigns, test = testsigns, cl = signtypes)

mean(signsactual == k_1)**[1] 0.9464286**

k_7 <- knn(train = trainsigns, test = testsigns, cl = signtypes, k = 7)

mean(signsactual == k_7)**[1] 0.9375**

k_15 <- knn(train = trainsigns, test = testsigns, cl = signtypes, k = 15)

mean(signsactual == k_15)**[1] 0.7410714**

*Which
value of k gave the highest accuracy? k_1 and value is 0.9464286*

*Seeing how the neighbors voted**When multiple nearest neighbors hold a vote, it can sometimes be useful to examine whether the voters were unanimous or widely separated.*

*For example, knowing
more about the voters’ confidence in the classification could allow an
autonomous vehicle to use caution in the case there is any chance at all that a stop sign is ahead.*

*Here, we will learn how to obtain the voting results from the knn() function.*

*R Code:**# Use the prob parameter to get the proportion of votes for the winning class*

signpred <- knn(train = trainsigns, test = testsigns, cl = signtypes, k = 7, prob = TRUE)

*# Get the “prob” attribute from the predicted classes*

signprob <- attr(signpred, “prob”)

*# Examine the first several predictions*

head(signpred)

*# Examine the proportion of votes for the winning class*

head(

*Output:*

head(signpred)

[1] stop stop stop stop stop stop

Levels: hospital man_at_work narrow_road pedestrian petrol road_wideness school speed speed_break stop

head(signprob) [1] 1 1 1 1 1 1

*Now you can get an
idea of how certain our kNN learner is about its classifications.*

*Before applying kNN to a classification task, it is common practice to rescale the data using a technique like min-max normalization. What is the purpose of this step is to ensure all data elements may contribute equal shares to distance. Rescaling reduces the influence of extreme values on kNN’s distance function.*

*Conclusion*

*Above Gartner hype cycle 2018 autonomous vehicles show that autonomous vehicle level 4 is under construction it takes more than 10 years. Currently, Audi A8 introduced AV level 3 in 2018. Through this article, I suggest that it’s one of the ways to applying a stochastic model to the autonomous vehicle, we can apply in a various way depends on the problem. So, I give an idea that what will be the next traffic signs appear on road and in future we will develop the AV to perform decision making for this kind of situation.*

*References**1. Gartner Inc.**2. Brett Lantz, Data Scientist at the University of Michigan (Datacamp.com)**3. Introduction to stochastic processes with R by Robert P. Dobrow**4.* *https://en.wikipedia.org/wiki/List_of_stochastic_processes_topics**5. **https://www.datasciencesociety.net/stochastic-processes-and-applications*

The post Extending Stochastic Models for Autonomous Vehicle Systems appeared first on StepUp Analytics.

]]>The post Steps Of K-Means Clustering In R appeared first on StepUp Analytics.

]]>Clustering can be used to improve predictive accuracy by segmenting databases into more homogeneous groups. Then the data of each group can be explored, analyzed, and modeled.

Clustering is used to classify items or cases into relatively homogeneous groups called clusters and objects in one cluster tend to be similar to each other and dissimilar to objects in the other clusters.

K-Means Clustering groups items or observations into a collection of K clusters and the number of clusters, K, may either be specified in advance or determined as a part of the clustering procedure. K-Means clustering has been included in the Machine Learning section of CS2 (Risk Modelling and Survival Analysis). Let’s have a look at the procedure and how it’s applied in R.

**1.** Partition the items into K initial clusters, where K is any initial estimate of the number of clusters which can be determined according to the business requirements. Alternatively, it can be determined by using the elbow method (which is a widely used technique), which will be discussed further in this article.

**2. **Euclidean distance with either standardized or unstandardized observations is calculated. Assign an item to the cluster whose centroid (mean) is nearest. Recalculate the centroid for the cluster receiving the new item and for the cluster losing the item.

**3.** Repeat Step 2 until no more reassignments take place.

Let’s have a look at K-Means Clustering on the Wholesale Customer dataset (ref. UCI Machine Learning Repository) using R.

**Data description:**

- FRESH: annual spending on fresh products (Continuous);
- MILK: annual spending on milk products (Continuous);
- GROCERY: annual spending on grocery products (Continuous);
- FROZEN: annual spending on frozen products (Continuous)
- DETERGENTS_PAPER: annual spending on detergents and paper products (Continuous)
- DELICATESSEN: annual spending on and delicatessen products (Continuous)

*K-Means function can be used from the ‘stats’ package in R. You can install and call the package by the following steps:*

install.packages(“stats”)

library(stats)

**Step 1:** Read the data using import dataset or read.csv( ), and assign it to data1.**Step2: **Getting the descriptives of the data using summary( ) in R

**Step 3:** Here, we observe that the data has a large range of values for some variables as compared to others. The variables with a larger range of values tend to dominate, so we standardize all the variables so that each uses the same range. We rescale the variables so that they have a mean of 0 and a standard deviation of 1.

A large z-score implies that observation is far away from the mean in terms of standard deviation, eg. A z-score of 3 means that the observation is 3 standard deviations away from the mean.

We rescale the data using scale( ) in R.

data1 <- scale(data1)

**Step
4:** Now we need to find the optimal number of
clusters, K. The elbow method analyses how the homogeneity or heterogeneity
within the clusters changes for various values of K. Homogeneity within
clusters usually increases as additional clusters are added and heterogeneity
decreases. The goal is to find that value of K beyond which there is negligible
gain in information. If one plots the percentage of variance explained by the
clusters against the number of clusters, the first clusters will add much
information (explain a lot of variance), but at some point the marginal gain
will drop, which is indicated by an elbow in the curve, thus called the Elbow
method.
We do this in R using a function which gives within
sum of squares for different values of no. of clusters.

withss <- sapply(1:10,

function(k) {

kmeans(data1, k, nstart = 50, iter.max = 15)$tot.withinss})

Plotting Within Sum of squares v/s Number of clusters, using plot( ) in R.

plot( 1:10, withss, type = “b”, pch = 19, frame = FALSE, xlab = “Number of clusters”, ylab = “Within Sum of squares” ) axis(1, at = 1:10, labels = seq(1, 10, 1))

In figure 1, we observe that there is an elbow at 2 and 5 number of clusters. On analysis for values of K ranging 2 to 5, we observe that the optimal no. of clusters is 3 for understanding the optimal customer segmentation. The 5 cluster solution gives a more detailed customer segmentation, but at this stage, we’ll have a look at the 3 cluster solution. Thus, work out cluster analysis for 3 clusters using kmeans( ) in R:

clust_output <- kmeans(data1, centers = 3)

**Step 5: ** Analyzing the cluster analysis output,

There are 3 clusters of sizes 49, 347 and 44 respectively. The cluster centers give us insights about the cluster description.

**Cluster 1** has highest spenders on Fresh, Frozen products and Delicatessen products. This cluster consists of consumers who spend more on fine foods and are high spenders.

**Cluster 2** has low spenders across all products.

**Cluster 3** has highest spenders on Milk, Grocery, Detergent and Paper. This cluster consists of consumers who spend majorly on domestic and household products.

The post Steps Of K-Means Clustering In R appeared first on StepUp Analytics.

]]>The post H20 Package: Classification Using Logistic Regression appeared first on StepUp Analytics.

]]>But in this

Here logistic regression comes from the underlying assumption of the GLMs which I will discuss in the next section. I will be talking about two ways of carrying out logistic regression in R. One being the standard method of using the **glm** function from the base package and the other being the h2o.glm function from the h2o package in R. We will also see how the accuracy has improved from the first model to the second model.

**What are
Generalized linear models and how do they differ from the classical linear
models?**

We already know that the distribution of the error term in the linear models is assumed to follow a normal distribution. But in the cases where we have binary classes for the response variable, we assume that the error term does not follow the normal distribution. Rather it follows the logistic distribution, given by the Cumulative density function:

Hence the term logistic regression. The above cdf can be graphically represented as:

Which is also known as the sigmoid function. The output of this function will always be between 0 and 1.

For the analysis I’ll be using an example dataset and the following steps will be followed:

- Reading the data
- Splitting the data into training and testing sets
- Applying glm on the training set
- Prediction using the test data
- Calculating the accuracy

The dataset considered here contains 21 variables and 3168 observations, where the label variable represents if the voice of an individual considered is a male or female. Before we step forward for the analysis, there is some pre-processing of the data required. We will first subset the data and consider only those variables that are important for our analysis and then convert the label variable into factor variable with levels 1 and 2 representing female and male respectively.

The data looks somewhat like this:

data <- read.csv(“voice.csv”)

head(data)

data$label <- factor(data$label)

str(data)

names(data)

Along with label we have a set of 20 other variables that are the descriptive statistics upon which our response variable depends.

Now, we attempt to partition the data into training and testing data sets. For

library(caret)

set.seed(99)

Train=createDataPartition(data$label,p=0.75, list=F)

training <- data[Train,]

testing <- data[-Train,]

Use set.seed(99) for replicability purposes. In the nest

fit <- glm(label~Q25+Q75+sp.ent+sfm+mode+meanfun+minfun, data=training, family = binomial(link = “logit”))

To check what model has been formed and in order to interpret the results, we use the summary function to extract all the information possible.

summary(fit)

Results:

The call function shows that we have rightly executed the code and that our response and predictor variables are rightly placed after and before the ~. From the table, we can see that all of the independent variables used in the modeling are significant at 10% level of significance. Although, we can remove mode variable if we consider a 5% level of significance.

Apart from the table, it is necessary to note that the AIC value obtained from this model is 440. We will try to reduce this using a different method in the next section. But before that, we now try to calculate how well this model performs on testing or unseen data.

p <- predict(fit,newdata = testing,type = “response”)

head(p) #gives the first 6 observations

The serial numbers simply mean that the first 10 observations might have been considered in the testing dataset and like that for observation 14. In order to check how accurate the above model classifies the gender label on test data, we set a threshold of 0.5. That is the observations with probability greater than 0.5 will be considered a male else a female.

To measure the accuracy and to check for the misclassification, we form a confusion matrix.

pred1 <- ifelse(p>0.5,2,1)

tab <- table(pred1,testing$label)

tab

Thus, from the table, we can see that there 379 observations that were correctly classified as females and 386 correctly classified males. Now, in order to calculate the accuracy we use:

Which gives a great accuracy of 96.6%.

Before we proceed to the analysis it is necessary to understand what this package is and what does it do?

H2o is a leading open source platform that makes it easy for financial services, insurance companies, and healthcare companies to deploy AI and deep learning to solve complex problems. To make it easier for non-engineers to create complete analytic workflows, H2O’s platform includes interfaces for R, Python and many more tools.

The steps that will be followed here are quite different from the previous case:

- Initialise the H2o package.
- Read in the data
- Data pre-processing if required
- Convert data into H2o readable format
- Split the data into training and testing sets
- Check for the accuracy of the model on the test data

Following the above steps:

library(h2o)

h2o.init(max_mem_size = “2G”, nthreads = 2, ip=”localhost”, port=54321) data <- read.csv(“voice.csv”)

data1 <- data[,c(4,5,9,10,11,13,14,21)]

d.hex <- as.h2o(data1,destination_frame = “d.hex”)

Now we have the data that can be read in by the h2o package. Converting the data might take a few seconds, depending upon the configuration of your laptop. Wait until the conversion shows 100%. To check the first few values you can use the head function.

For splitting the dataset, instead of the caret package we use the inbuilt functions from the h2o package only.

head(d.hex)

set.seed(99)

split <- h2o.splitFrame(data = d.hex,ratios = 0.75)

train <- split[[1]]

test <- split[[2]]

After running these functions, we can now carry out the glm function from the h2o package only on the training data and then check it on the test data for how accurately it classifies categories. Again running these codes might take a few seconds.

fit3 <- h2o.glm(x=1:7,y=8,training_frame = train,family = “binomial”, link=”logit”)

h2o.performance(fit3,test)

The performance function will give a list of variables of performance measures like RMSE, Logloss, AUC etc. But in this blog, I intend to concentrate only on the AIC and the confusion matrix.

We can clearly see that the AIC value lower from the first model. Which is an indication of a better and a robust model.

We extract the confusion matrix to measure accuracy and misclassification. The beauty of this package is that it contains all the necessary functions required for the analysis so that you don’t have referred to different packages for different functions.

Clearly 390 females and 373 males are rightly classified. The accuracy is given by: (390+373)/(390+373+13+8) = 97%. Which slightly greater than the previous model. But there is no significant difference in the accuracy between the two models.

In order to check the predictions made for each observation in the test data and the how strong the probability is for the prediction made we use the following function:

So the prediction made for the first observation is a male with probability **0.99966 **which is quite high and so on.

And this is how you can use two different methods of carrying out logistic regression on the same dataset.

**Download** the data Used in this blog. Read the latest articles on **Machine Learning**

The post H20 Package: Classification Using Logistic Regression appeared first on StepUp Analytics.

]]>The post What Is Classification appeared first on StepUp Analytics.

]]>Classification is the process of predicting the class of given data points. Classes are sometimes called as targets/ labels or categories. Classification predictive modeling is the task of approximating a mapping function (f) from input variables (X) to discrete output variables (y).

Unlike regression where you predict a continuous number, you use classification to predict a category. There is a wide variety of classification applications from medicine to marketing.

Let’s say you own a shop and you want to figure out if one of your customers is going to come visit your shop again or not. The answer to that question can only be a ‘Yes’ or ‘No’.

These kind of problems in Machine Learning are known as Classification problems.

Classification problems normally have a categorical output like a ‘yes’ or ‘no’, ‘1’ or ‘0’, ‘True’ or ‘false’. Let’s go through another example:

Say you want to check if on a particular day, a game of cricket is possible or not.

In this case the weather conditions are the dependent factors and based on them, the outcome can either be ‘Play’ or ‘Don’t Play’.

Just like Classification, there are two other types of problems in Machine Learning and they are:** **

**Regression and Clustering**

In the image above, we have the list for all the different algorithms or solutions used for each of the problems.

There are 5 types of algorithms used to solve classification problems and they are-

- Decision Tree
- Naive Bayes
- Random Forest
- Logistic Regression
- KNN

Based on the kind of problem statement and the data in hand, we decide the kind of classification algorithm to be used.

Decision tree builds classification or regression models in the form of a tree structure. It breaks down a data set into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes. A decision node has two or more branches and a leaf node represents a classification or decision. The topmost decision node in a tree which corresponds to the best predictor called root node. Decision trees can handle both categorical and numerical data.

It is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter. Even if these features depend on each other or upon the existence of the other features, all of these properties independently contribute to the probability that this fruit is an apple and that is why it is known as ‘Naive’.

Naive Bayes model is easy to build and particularly useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.

Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and P(x|c). Look at the equation below:

Above,

*P*(*c|x*) is the posterior probability of*class*(c,*target*) given*predictor*(x,*attributes*).*P*(*c*) is the prior probability of*class*.*P*(*x|c*) is the likelihood which is the probability of*predictor*given*class*.*P*(*x*) is the prior probability of*predictor*.

Random Forest is a supervised learning algorithm. Like you can already see from its name, it creates a forest and makes it somehow random. The “forest” it builds, is an ensemble of Decision Trees, most of the time trained with the “bagging” method. The general idea of the bagging method is that a combination of learning models increases the overall result.

**To say it in simple words: Random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction.**

One big advantage of random forest is, that it can be used for both classification and regression problems, which form the majority of current machine learning systems. I will talk about the random forest in classification since classification is sometimes considered the building block of machine learning. Below you can see how a random forest would look like with two trees:

Imagine a guy named Andrew, that wants to decide, to which places he should travel during a one-year vacation trip. He asks people who know him for advice. First, he goes to a friend, and asks Andrew where he traveled to in the past and if he liked it or not. Based on the answers, he will give Andrew some advice.

This is a typical decision tree algorithm approach. Andrews friend created rules to guide his decision about what he should recommend, by using the answers of Andrew.

Afterward, Andrew starts asking more and more of his friends to advise him and they again ask him different questions, where they can derive some recommendations from. Then he chooses the places that where recommend the most to him, which is the typical Random Forest algorithm approach.

The k-nearest-neighbours algorithm is a classification algorithm, and it is supervised: it takes a bunch of labelled points and uses them to learn how to label other points. To label a new point, it looks at the labelled points closest to that new point (those are its nearest neighbours), and has those neighbours vote, so whichever label the most of the neighbours have is the label for the new point (the “k” is the number of neighbours it checks).

*k*-Nearest Neighbour is a lazy learning algorithm which stores all instances correspond to training data points in n-dimensional space. When an unknown discrete data is received, it analyzes the closest k number of instances saved (nearest neighbors)and returns the most common class as the prediction and for real-valued data, it returns the mean of k nearest neighbors.

In the distance-weighted nearest neighbor algorithm, it weights the contribution of each of the k neighbors according to their distance using the following query giving greater weight to the closest neighbors.

Usually, KNN is robust to noisy data since it is averaging the k-nearest neighbors.

- What is Logistic Regression?
- How it works
- Logistic VS. Linear Regression
- Advantages / Disadvantages
- When to use it
- Implementation in Python

**Logistic regression** is a statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome. The outcome is measured with a dichotomous variable (in which there are only two possible outcomes). It is used to predict a binary outcome (1 / 0, Yes / No, True / False) given a set of independent variables. To represent the binary/categorical outcome, we use dummy variables.

You can also think of logistic regression as a special case of linear regression when the outcome variable is categorical, where we are using log of odds as the dependent variable. In simple words, it predicts the probability of occurrence of an event by fitting data to a **logit** function.

**Logistic regression **was developed by statistician **David Cox** in 1958. This binary logistic model is used to estimate the probability of a binary response based on one or more predictor (or independent) variables (features). It allows one to say that the presence of a risk factor increases the probability of a given outcome by a specific percentage.

Like all regression analyses, the **logistic regression** is predictive analysis. Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables.

**Application of Logistic Regression:** It’s being used in Healthcare, Social Sciences & various ML for advanced research & analytics.

Logistic Regression measures the relationship between the dependent variable (our label, what we want to predict) and the one or more independent variables (our features), by estimating probabilities using it’s underlying logistic function.

These probabilities must then be transformed into binary values in order to actually make a prediction. This is the task of the logistic function, also called the sigmoid function. The Sigmoid-Function is an S-shaped curve that can take any real-valued number and map it into a value between the range of 0 and 1, but never exactly at those limits. This values between 0 and 1 will then be transformed into either 0 or 1 using a threshold classifier.

The picture below illustrates the steps that logistic regression goes through to give you your desired output.

Below you can see how the logistic function (sigmoid function) looks like:

We want to maximize the likelihood that a random data point gets classified correctly, which is called Maximum Likelihood Estimation. Maximum Likelihood Estimation is a general approach to estimating parameters in statistical models. You can maximize the likelihood using different methods like an optimization algorithm.

Newton’s Method is such an algorithm and can be used to find the maximum (or minimum) of many different functions, including the likelihood function. Instead of Newton’s Method, you could also use Gradient Descent.

You may be asking yourself what the difference between logistic and linear regression is. Logistic regression gives you a discrete outcome but linear regression gives a continuous outcome. A good example of a continuous outcome would be a model that predicts the value of a house. That value will always be different based on parameters like it’s size or location. A discrete outcome will always be one thing (you have cancer) or another (you have no cancer).

It is a widely used technique because it is very efficient, does not require too many computational resources, it’s highly interpretable, it doesn’t require input features to be scaled, it doesn’t require any tuning, it’s easy to regularize, and it outputs well-calibrated predicted probabilities.

Like linear regression, logistic regression does work better when you remove attributes that are unrelated to the output variable as well as attributes that are very similar (correlated) to each other. Therefore Feature Engineering plays an important role in regards to the performance of Logistic and also Linear Regression. Another advantage of Logistic Regression is that it is incredibly easy to implement and very efficient to train. I typically start with a Logistic Regression model as a benchmark and try using more complex algorithms from there on.

Because of its simplicity and the fact that it can be implemented relatively easy and quick, Logistic Regression is also a good baseline that you can use to measure the performance of other more complex Algorithms.

A disadvantage of it is that we can’t solve non-linear problems with logistic regression since it’s decision surface is linear. Just take a look at the example below that has 2 binary features from 2 examples.

It is clearly visible that we can’t draw a line that separates these 2 classes without a huge error. To use a simple decision tree would be a much better choice.

Logistic Regression is also not one of the most powerful algorithms out there and can be easily outperformed by more complex ones. Another disadvantage is its high reliance on a proper presentation of your data. This means that logistic regression is not a useful tool unless you have already identified all the important independent variables. Since its outcome is discrete, Logistic Regression can only predict a categorical outcome. It is also an Algorithm that is known for its vulnerability to overfitting.

Like I already mentioned, Logistic Regression separates your input into two „regions” by a linear boundary, one for each class. Therefore it is required that your data is linearly separable, like the data points in the image below:

In other words: You should think about using logistic regression when your Y variable takes on only two values (e.g when you are facing a classification problem). Note that you could also use Logistic Regression for multiclass classification, which will be discussed in the next section.

Importing the Essential libraries

Importing the Dataset

Splitting the dataset into the Training set and Test set

Feature Scalling

Fitting logistic regression to the training set

Predicting the Test Set Result

Making the Confusion Matrix

Output of Confusion Matrix

Tips – so the confusion matrix it means that 65+24 =89 are the correct predictions and 8+3 =11 are the incorrect predictions.

Visualizing the Training set Results

Output of the Training Set

Visualizing the Test Set

Output of the Test Set

The post What Is Classification appeared first on StepUp Analytics.

]]>The post Understanding Natural Language Processing appeared first on StepUp Analytics.

]]>Usage of various languages for verbal communication is an integral part of communication. Now with the advent of Artificial Intelligence and Machine Learning, we are trying to implement the capability of communication in machines as well.

This article will help in the understanding of Natural Language Processing which is a collection of different principles used for performing various tasks related to NLP.

**Through
this article, we will look at following topics:**

- What is Natural Language Processing?
- Tasks performed using NLP.
- Essential NLP Libraries
- Fundamentals of NLP
- Data Retrieval through Web Scraping.
- Preprocessing the text.
- Feature Engineering on text data.
- Modeling the data and/or Making predictions.

**What is Natural Language Processing?****Natural Language Processing **is a field which is created by amalgamating computer science and artificial intelligence. Using NLP, we are concerned with interactions between computers and human natural languages.

NLP encompasses varied tasks which are aiming for systematic processes. These processes are used to analyze, understand and derive useful data i.e. knowledge from text data which is available from numerous sectors.

NLP is especially used when we have unstructured data like text, image, and videos. To work with this data, NLP is used which provides tools, techniques, and algorithms to process this natural – language data.

**Essential libraries used for NLP in python**

With python, we have the advantage of performing some of the complex tasks using libraries. Similarly, there are multiple libraries which facilitate various operations in NLP.

Let’s have a look at the following libraries:-

**Natural Language Toolkit**(NLTK)

NLTK is foremost platform for constructing python programs when we deal with linguistic data. NLTK is open source, community-driven project.

**Scikit-learn**

This library is not just used for NLP but it’s applied widely in Machine learning.

**TextBlob**

TextBlob provides with NLP tools API which can be easily handled by the users.

**spaCy**

Using spacy, we can implement concepts of NLP by using Python and Cython. It has some excellent capabilities for named entity recognition.

**Genism**

Genism is specifically used for Topic Modelling while we deal with text documents

**Stanford Core NLP**

One of the most famous

**Fundamentals of NLP****Getting Data through various sources.**

Data is the primary requirement for any kind of operation in NLP. Therefore, we have two options which can be either to get datasets from various sources related to different domains. The other option is to scrape data from a website which is known as Web Scraping.

**Preprocessing the text**

The data obtained from different sources is full of noise i.e. errors which will affect the precision of the result.

So it is always recommended to perform cleaning and standardization of the text. This will make the text noise free and ready for further analysis.

There are numerous methods which are opted to perform preprocessing of the text. Some of them are as follows:

**Noise Removal.**Any text present in the
document which is impertinent to the context of data and to the final output
can be termed as noise.

**For example**

**Text Normalization.**Many times we encounter a
single word with different representations and this is considered as another
form of noise. For example: sing, singer, singing, sings, sang are multiple
variations of the word – **sing**.

To deal with this, we use two methods which are **Stemming** and **Lemmatization**.

**Feature Engineering on text data.**

After completing preprocessing of data, we have to perform feature engineering which will help in further predictions. On the basis of the features used, we will have to decide the model, if built.

Again there are various ways through which we can extract features from the data. Some of those methods are as follows:

**Syntactic Parsing:** In the parsing we are analysing words in the sentence for grammar and even deduce the relationship among the words. Here Dependency parsing and Parts of Speech tagging are the preferred methods for analysing the syntax.

**Dependency Trees:**The sentence formation takes place by joining words together. This relationship which words possess within themselves is depicted through dependency grammar.**Parts of Speech Tagging:**Every word used in the sentence is depicting a part of speech (pos) tag (nouns, verbs, adjectives, adverbs). Through pos tag we are able to understand how a word is used and for what is it used.

**Entity Extraction.**Entities are considered of paramount importance when we are considering any sentence – These entities are noun phrases, verb phrases or both. The various algorithms which are used for entity extraction are using techniques like rule-based parsing, dictionary lookups, pos tagging, and dependency parsing.

There are many known ways which are used for this purpose but Topic Modelling and Named Entity Recognition are the most widely used methods.

**Named Entity Recognition: **Through named entity recognition we are detecting the named entities like person names, names of different locations, company names, names of organizations from sentences. For example:

Sentence: Tim Cook, the CEO of Apple Inc. was present in the launch of new iPhone at New York.

*Named Entities: (“person”: “Tim Cook”), (“org”: “Apple Inc.”), (“location”: “New York”)*

**Topic Modelling: **In topic modelling our main aim is to identify the topics in a text document. Topic

Here topics are defined as a frequent pattern which has similar terms occurring together in a corpus

** Statistical Features**. All of the above mentioned ways are used to convert the text data into numerical data so that the algorithms can understand it. But there are ways through which we can quantify the data directly into numbers.

One of the method is **Term Frequency – Inverse Document Frequency**
(TF-IDF).

This is a weighted model which is used for information extraction. Through this we convert the text into vector models which is based on the occurrence of words in documents.

**Term Frequency** (TF) – This tells the
number of times a term has been present in the document.

**Inverse Term Frequency** (IDF) – IDF is used to assign more

**Tasks performed using NLP.**

Once we have completed the preprocessing feature engineering of the text data, we can do so many different experiments with this data and get different results.

As already mentioned, NLP comprises of disparate tasks through which we have been able to make big strides in this linguistic interaction with the machine.

Let’s have a look at the following tasks:**Speech Recognition**

This application of NLP has made the actual communication between us and the machine. Some of the most famous

**Automatic Summarization**

With

**Machine Translation**

Using this, we have been capable of converting given text from one language to another desired language. One of the most popular

**Named Entity Recognition** In any sentence, there is

**Sentiment Analysis**

One of the most popular applications of NLP is analysing sentiments of large and unique datasets varying from surveys to reviews. Through sentiment analysis we are aiming to understand the views expressed it.

These views are either classified as positive, negative or neutral. This classification is done by quantifying the polarity score.

The main requirement for sentiment analysis is it should have subjective text i.e. there should be sentences/phrases expressing emotions/opinions/views. Thus, any kind of objective text will not provide a desired results. This application is extensively used for social media analysis of any field, trying to gauge the public’s response in elections, reactions towards movie etc.

**Topic Segmentation/Topic Modelling**

Each and every document/article is comprised of various topics. The process of identifying the topics in text corpus is known as Topic Modelling/Topic Segmentation.

For deriving topics, different algorithms are used which mainly provides results by looking at the frequencies of all the words.

We have already discussed most of the fundamentals of NLP. Now it’s time to use them to implement a real-world scenario.

Here we will be scraping the news articles from ** Inshorts**, a website which is built to provide short 60-words news articles on wide range of topics.

Here in this article, we will be text data from news articles on sports, science Frequency-Inverse and technology.

**Step: 1 – Scraping the data from news articles.**

First, we have to create web scraper which will extract the data from the news articles. For this, let’s import the dependencies i.e. the necessary libraries are imported.

For this extraction of data we require **requests **and **Beautiful soup **libraries.

Now we have to use requests library to access and fetch the content of HTML tags from the

Through **build_dataset ()** function we are fetching the news headline, article text

We can see the dataset of news articles which is created and to check the number of news articles fetched, we can use the code mentioned below.

It can be seen that 25 articles are extracted of each of the three categories.

**Step: 2 – Preprocessing the text**

Now to make the text data competent for analysis, we will have to perform numerous steps for preprocessing the text. For performing the preprocessing we will be requiring the following libraries and thus, we’ll have to import them.

Here the text is tokenized i.e. the words are converted into tokens. Moreover, the stopwords ‘no’ and ‘not’ are also removed as they are of significance.

**Removing HTML Tags: **The text data obtained through web scraping contains noisy data. Especially, HTML tags which are not adding any value to the understanding of the data. Thus, we should remove it.

So the above code helps us to remove the unwanted HTML tags and get the text from the document.

**Expansion of Contracted words.**In our day-to-day spoken
language, we use a huge number of short forms of words. This shortened words
many times will not make sense for our analysis. Thus, we have to make these
short forms into their original representation.

To understand the various contractions, there is a contraction.py file in my repository which is used here.

**For example**: We’ll to we will

**Removal of Special Characters: **All the non-alphanumeric characters are said to be the special characters. On many occasions we even remove the numeric characters from the text data, but this depends on the problem we have on our hands.

For removing special characters we use regular expressions which have the ability to remove these symbols pretty easily.

**Lemmatization**

Lemmatization is used to get the original form of the word i.e. root form of the word. This removes the ambiguity between the various forms of same word. Here we have used **nltk**** **for lemmatization of words.

**Stemming**: Stemming is similar to lemmatization, but here we are focusing on the word stem and not the root. So the main aim is to remove the suffixes (“ing”, “ed”) attached with a

The point which we should note is that after stemming many times we obtain a word which is not lexicographically correct. But stemming helps in standardization of text.

Again, for stemming we have inbuilt functions of

**Removal of Stopwords: **The most frequent words used in our communication like **a, an, the, and** etc. are of no significance. In **articles, conjunctions and prepositions **contribute to

So it is recommended to remove these “

**Text Normalizer: **According to our dataset, we can perform many preprocessing steps but here we will end this. Now all the above techniques will combined under one function and then used for our dataset.

Now it’s time to use this text normalizer but before

This is how the text normalizer works for cleaning of the data. The clean text and full text have clear differences which

**Step: 3 – Sentiment Analysis: **

For sentiment analysis of the news

To manage this unlabelled data we will use lexicons. Lexicon is a dictionary, vocabulary, or book of words. Here we will be using lexicons for sentiment analysis.

There are different types of lexicons and one of them is AFINN Lexicon. AFINN lexicon is

The above created corpus is used for generating sentiment scores which is then used to classify them into three categories.

Let’s look at what these sentiment scores of the three categories of news articles are conveying by using **pandas **library.

We can deduce from the results that sports news articles average sentiments are positive whereas news articles of technology have average which indicates negative sentiment.

Now we will be creating visualizations of these news articles and then analyse the results.

Here we are creating strip plot and box plots using the Seaborn library.

Both of
these visualizations simplify the results obtained by us. We can see that **technology** and **science** has news articles spread over negative and positive range.

We can even create a depiction of

This chart corroborates the fact that technology has

**Sentiment Analysis with TextBlob:**

Another library which is open-source and extensively used for NLP tasks is TextBlob. Now we’ll perform sentiment analysis with this library.

Again we are generating polarity of sentiments and then classifying them into the three categories.

We’ll repeat the process of analysing the sentiments.

Here we can see that average of sentiment scores is positive for all the three categories with sports news articles with highest average. Now we will visualize the results using Seaborn library.

Here technology has most number of positive sentiment news articles whereas science news articles consists of most number of negative sentiment of news articles.

The Jupyter notebook consisting of the code and dataset for this article can be found here.

The post Understanding Natural Language Processing appeared first on StepUp Analytics.

]]>The post Multivariate Adaptive Regression Splines appeared first on StepUp Analytics.

]]>**What is a non-parametric regression?**

In most popular regression techniques like generalized linear regression (GLM) and multiple linear regressions (LM) etc. the dependent variable is hypothesized to depend linearly on the predictor variables and the values of the dependent variable are predicted based on the values of the independent predictor variables. For example:

**House_Prices = Constant + 2.5*No_of_rooms + 2.0*No_of_floors +3.2*Total_Area**

Here the independent variables are the number of rooms, number of floors and the total area of the house. The dependent variable is the house price. The numerical values which are multiplied by the independent variables are the **regression coefficients**. Higher the value of the regression coefficients, higher is the influence of the independent variable over the dependent variable. If the variables are scaled properly, then direct comparisons between these coefficients are more relevant and useful.

The example given above is a parametric regression, which assumes a relationship between the dependent and independent variables prior to the regression modeling. But, **nonparametric regression** does not make such assumptions about the relationship between the variables. Instead, it constructs the relation from the coefficients and so-called **basis functions** (**Wiki Click**) that

MARSplines algorithm performs such regression techniques along with the search for **nonlinearities** in the data that helps to maximize the predictive accuracy of the model. So, MARSplines technique has taken a step forward in creating successful results where the relationship between the predictors and the dependent variables is difficult to establish.

The MARSplines model equation is given below:

In the given equation output vector y is predicted as a function of the predictor variables X. The constant term is B0 and BM is the associated coefficient. The function from a set of one or more basis functions is hm(X).

**How the MARSplines algorithm works?**

MARSplines method partitions the input data into different parts, each with its own regression equation. The partitioning of the data happens with the help of **hinge **or **rectifier functions** (**Wiki Click**) which takes the form:

Where t is a constant and it is also called a **knot**. This knot helps to model a nonlinear regression.

However, this diagram shows a single hinge function in action to attain the nonlinearity, but in reality, with a large number of

Now the model builds in two parts:

1. FORWARD PASS

2. BACKWARD PASS

**Forward Pass**

In this step we first build the model with just the intercept term. After that it starts adding basis functions in pairs. At each step it adds a pair of basis functions for each variable until it reaches to the point of minimum prediction error. The pair of basis functions are identical to each other. Each new basis function added to the model consists of a term already in the model multiplied by a new hinge function. Here we just add basis functions in a greedy way one after another, so it is also known as the **greedy algorithm**.

**Backward Pass**

Backward Pass is a method of removing those basis functions from the model which are least significant. In

Nonparametric models exhibit a high degree of flexibility that may ultimately result in overfitting of the model. This high degree of flexibility leads the model to compromise its accuracy when it is presented with a new dataset. To combat this problem, the backward pass which is also known as a “**pruning pass**” because it uses the pruning technique to limit the complexity of the model by reducing the number of its

**Implementation of MARSplines in R**

The MARSplines algorithm is available in the R package** earth **and we install it with:** **

**install.packages(“earth”)**

Now calling the package to use the function** earth **

**library (earth)**

I have used a dataset known** Boston **which is present in the** MASS package**. I have used this dataset to show you a comparison between **MARSplines **and other regression and penalized regression techniques.

**library(MASS)****data <- Boston**

Splitting the dataset in two parts

**train <- data [1:400,]****test <- data [401:506,]**

We fit the model and save the model in** Fit**

**Fit <- earth (medv ~. , data = train )**

We call the summary of the model to see the values of the parameters

**summary (Fit)**

To see the importance of input variables we have to use the function

**evimp()****evimp (Fit)**

Read more details about **GCV** **Wiki Reference**)

We predict the values of the House price
and save it in** Predictions **

**Predictions <- predict(Fit, test)**

Lastly, to see the accuracy of the model
for the test dataset, we use the function **rmse()**

**library(Metrics)****rmse (test$medv,Predictions_test)**

**CONCLUSION**

From the rmse value it can be concluded that MARSplines algorithm works very well in regressions problems. We can also tell that the model has **worked better than the logistic regression and other penalized regression techniques** **like ridge ,lasso and elastic net**.

[Note: Compare the accuracy of MARS with the accuracies of the event techniques from my previous article: (https://stepupanalytics.com/lasso-and-elastic-net-regression/ )

You can also use the MARSplines algorithm with **mars()** from the ** mda** package : (https://cran.r-project.org/web/packages/mda/mda.pdf )

**NOTE:**

Two important features of MARSplines algorithm :-

- It can be
**applied to multiple dependent variables**. The algorithm determines a common set of basis functions in the predictors, but estimates different coefficients for each dependent variable. - Because MARSplines can handle multiple dependent variables; it is easy to apply the algorithm to classification problems as well.

The post Multivariate Adaptive Regression Splines appeared first on StepUp Analytics.

]]>The post Cyber Security Using Machine Learning: SNORT appeared first on StepUp Analytics.

]]>Computer security or IT security is the protection of computer systems from theft or damage to their hardware, software or electronic data, as well as from disruption or misdirection of the services they provide. The field is of growing importance due to increasing reliance on computer systems, the Internet and wireless networks such as Bluetooth and Wi-Fi, and due to the growth of “smart” devices, including smartphones, televisions and the various tiny devices that constitute the Internet of Things. Due to its complexity, both in terms of politics and technology, it is also one of the major challenges of the contemporary world. (Source: Wikipedia)

The potential threats of the huge cyberspace are not hidden from anyone. Protecting our cyberspace is still a hot topic of research. In computers and computer networks an attack is an attempt to expose, alter, disable, destroy, steal or gain unauthorized access to or make unauthorized use of an Asset. There are two terms that are used very frequently while talking about cybersecurity: Intrusion Detection Systems (IDS) and Intrusion Prevention Systems (IPS).

IDS is the detection of any attack that has happened. IPS is the prevention of any such attack. It is easier to detect an attack than to completely prevent one. Machine learning can be used to increase the reliability of cybersecurity methods. In particular, we will talk about how machine learning can be used in Intrusion Detection Systems.

IDS can be classified into two main categories based on operational logic:

**Signature-based IDS****Anomaly-based IDS**

**Signature-based** IDS works with certain definitions of known vulnerabilities that are considered as attacks. Its operation logic is based on the basic classification problem. Incoming events are compared with signatures if a match found then an alert occurs; otherwise, it means no malicious event is found. It has low flexibility and it uses low-level machine learning structures. This system has very high accuracy for known attacks but fails in case of new attacks (zero-day attacks).

**Anomaly-based** IDS checks the behavior of the traffic and whenever there is an anomaly in the usual behavior, an alarm is raised. It has high flexibility and it uses high-level machine learning structures.

A lot of research has been going on in this area using both supervised and unsupervised algorithms. For the academic purpose, there are a lot of datasets available on the web for public use. The most popular is KDD99. The KDD data set is a well-known benchmark in the research of Intrusion Detection techniques. A lot of work is going on for the improvement of intrusion detection strategies while the research on the data used for training and testing the detection model is equally of prime concern because better data quality can improve offline intrusion detection.

The supervised approach usually deals with known attacks. It follows an algorithm that runs on well-defined attacks, that is, signature-based IDS. The dataset contains various definitions of malicious activities. The system works with labeled events occurred in the network. One of the several intrusion defined in the dataset is created by the network flow data. Artificial neural network when encounters any intrusion, it looks for the definition of that intrusion in the dataset. If any definition is found, an alarm is raised. However, if any definition is not found, the intrusion is ignored.

This approach has a very high accuracy in recognizing well known malicious activities. False alarm rates are very low in this case. Bayesian networks along with Support Vector Machine (SVM) are used to detect attacks in a supervised approach. Many artificially intelligent antivirus used in applications that require high security and very low alarm false rates, such as computers containing military information, computers operating missile etc. Many institutions in the USA have turned to the supervised approach for the security of documents and critical information.

However, this approach fails in case of 0-day attack.

Unsupervised approach for detection of cyber attack is used when the dataset doesn’t contain any definitions. The class of the attack, its features or anything about the attack is unknown. This approach assumes that huge change in the network flow happens only when any malicious agent has entered the system. The behavior of the network is monitored continuously. A threshold is set and whenever the anomaly crosses this threshold, alarm is raised.

In this approach the neural network functions on the network data rather than any class definitions. Thus it is very efficient in detecting 0 day attacks. However if the attacker produces the data intelligently, it can be surpassed. Moreover it creates a lot of false alarms. This is a major issue and research is going on to improve this algorithm.

Both techniques have advantages and disadvantages, to combine advantages in an efficient way, and eliminate disadvantages completely, some hybrid approaches are developed. A part of detection mechanism is working with the supervised algorithm, and another part is working with the unsupervised algorithm. In recent years most of the researches focus on hybrid detection approaches.

Snort is a free and open source network intrusion prevention system (NIPS) and network intrusion detection system (NIDS) and used all around the world. Snort’s open source network-based intrusion detection system (NIDS) has the ability to perform real-time traffic analysis and packet logging on Internet Protocol (IP) networks. Snort performs protocol analysis, content searching, and matching. These basic services have many purposes including application-aware triggered quality of service, to de-prioritize bulk traffic when latency-sensitive applications are in use.

Snort can be configured in three main modes: sniffer, packet logger, and network intrusion detection. In sniffer mode, the program will read network packets and display them on the console. In packet logger mode, the program will record packets to the disk. In the intrusion detection mode, the program will monitor network traffic and analyze it against a rule set defined by the user. The program will then perform a specific action based on what has been identified (Source Wikipedia).

Cyber attack detection is like a game between the attacker and the detection system. This is no ultimate winner in this game. Whenever an attack is detected, the attacker comes up with an efficient hacking algorithm that could surpass the detection. And whenever any attack surpasses the detection, new and efficient detection algorithms are developed. It is a never-ending cycle. Machine learning has improved the detection algorithms to a great extent. However, intelligent hackers are developing attacks that could surpass these by exploiting loopholes. Intense research is going on to remove these loopholes and come up with better algorithms.

The post Cyber Security Using Machine Learning: SNORT appeared first on StepUp Analytics.

]]>The post Beginner’s Guide to Reinforcement Learning appeared first on StepUp Analytics.

]]>- What is Reinforcement learning in simple words?
- The components of Reinforcement Learning problem.
- Distinguishing between Reinforcement learning, Supervised and Unsupervised learning.
- Algorithms used for implementing RL.
- Practical implementation of Reinforcement learning.
- Ways used for learning.
- The disadvantage of Reinforcement Learning.
- Applications of Reinforcement Learning around us.
- Real world implementation of Reinforcement Learning.

Reinforcement Learning is learning the best actions on the basis of rewards and punishment. But when we wear our technical goggles, then Reinforcement Learning is defined using three basic concepts i.e. states, actions, and rewards.

Here the “**state” **defines a situation in which an agent is present who performs some “**actions**” and based upon these actions the agent receives either rewards or punishment.

When we consider the example of the dog, there we have the owner of the dog and the “**dog**” (**Agent**) itself. Now when the owner of the dog is present in the garden with the dog, he/she throws away a ball**. **This throwing away of the ball is the “**state**” for the **agent** and now the **dog **will run after the ball which will be the “**action”. **

The result will be an appreciation or food for the dog from the owner which will be “**reward” **as a result of the action and if the dog does not go after the ball another alternate action then it may get some “**punishment”**. Therefore, this is what Reinforcement Learning is all about. Next, we’ll understand the terminology which Reinforcement learning comprises of.

Now for each and every Reinforcement Learning problem, there are some predefined components which help in better representation and understanding of the problem. The following are the components:-

**Agent**: Agent takes actions; as mentioned earlier in our example, the dog is the **agent**

**Action (A)**: The agent has set of actions **A **from which it selects which action to perform. Just like the dog who decided whether to go after the ball, just look at the ball or jump at the position.

**Discount Factor: **The **discount factor** is multiplied with the future rewards as discovered by the agent to reduce the effect of the agent’s choice of action. To simplify this, through **discount factor** we are making the future rewards less valuable than immediate rewards. This makes the agent look at short-term goals itself. So lesser the value of discount factor the more insignificant future rewards will become and vice versa.

**Environment: **It is the surroundings of the agent in which it moves. In the dog example, **the environment **consists of the owner and the garden in which the dog is present. It is the **environment **which gives the agent its rewards as an output based upon the agent’s current state and action as inputs.

**State: **A state is an immediate situation in which the agents finds itself in relation to other important things in the surroundings like tools, obstacles, enemies and prizes/rewards. Here the dog is required to

**Reward(R): **The **reward** is the output which is received by the agent in response to the actions of the agent. For example, the dog receives **dog food** as a **reward** if the dog (**agent**) brings back the ball otherwise it receives scolding as a **punishment** if it does not wish to do so.

**Policy: **Here policy is the strategy which agent uses to determine the actions which should be taken on the basis of the current state. Basically the agent’s maps states to actions i.e. it decides the actions which are providing the maximum rewards with regards to states. Talking about the dog example, when the dog comes to know that dog food will be given as a reward if it brings back the ball, keeping this in mind the dog will create its own policy to reap maximum rewards.

**Markov Decision Processes (MDP’s) **are mathematical frameworks to describe an environment in reinforcement learning and almost all RL problems can formalize using MDP’s.

Basically, **MDP’s **consist of a set of finite environment states S, a set of possible actions A(s) in each state, a real-valued reward function R(s) and a transition model as well.

All those who possess some basic knowledge of Artificial Intelligence would be well aware of the terms Supervised and Unsupervised learning. Similarly, Reinforcement learning has been the buzzword in the field of AI and its implementations have gained huge popularity. **For example**, The very famous **AlphaGo** was developed using Reinforcement learning by **Google Deepmind**, which went on to defeat the World Champion “**Lee Sedol”** of the Game **Go**.

Now you must be wondering why supervised learning or unsupervised learning was not used. So let’s look at the areas where Reinforcement learning is better as compared to the other two methods.

Supervised Learning gets its name from the usage of an external supervisor who is aware of the environment and shares the same knowledge with the agent for accomplishing the task. In general supervised learning is like learning from tasks which have been already completed and as an agent you have to obtain the experience from this. But in some cases, there are no tasks from which any experiences can be gained and thus we cannot have any supervisor.

Since the game of Go has to move counts in billions we cannot create a knowledge repository and thus, the only option left is playing more and more games to gain experience and extract knowledge from it.

So both supervised learning and reinforcement learning we are mapping between input and output but in reinforcement learning the reward function acts as the feedback or experience which is in contrast to supervised learning.

In unsupervised learning, there is no concept of mapping between input and output, unlike reinforcement learning. In unsupervised learning, our main aim is to find the hidden patterns. For example, most of the recommendation systems like movie recommendation, news articles use unsupervised learning for the same. So in this, we are building a knowledge graph on the basis of constant feedback which the customer provides by liking particular movies/articles and then similar things are recommended.

Supervised and Unsupervised Machine Learning Algorithms **Read**

Reinforcement learning along with its fundamental concepts needs to be implemented practically and for that, we use the following algorithms. Let’s have a look at those algorithms:

Q learning is the most used reinforcement learning algorithm. By the usage of this algorithm, the agent learns the quality (**Q value**) of each action (i.e. **policy**) based on how much reward the environment returns with.

Q Learning uses the table to store the value of each environment’s state along with the Q value.

SARSA resembles Q-learning to a lot extent. The only difference between the two is that SARSA learns the Q-value based on the action performed by the current policy as compared to Q-learning’s way of using greedy policy.

Now we will have a look at one of the basic implementation of Reinforcement Learning using the **OpenAI Gym library**. The Gym compares the different algorithms of Reinforcement Learning.

The gym provides us with a variety of test problem i.e. **environments**, all which can be used to know more about our reinforcement learning algorithms.

Before starting to work with Gym, we need to install gym using pip:

pip install gym

Directly from iPython notebook.

!pip install gym

After this, we are ready to start.

First, we are importing the gym library which we had installed earlier. Then we are using one of the inbuilt environments of Gym i.e. **CartPole**. This environment will display a pole trying to balance on a cart which is moving left and right.

**Step 2:**

Here we have created a function which will take care of the action which should be taken by the agent on the basis of state and environment. Using this function, our main aim is to maintain the pole present on the cart should try to balance and not fall down. So if the pole bends more than a given angle we are returning 0 as a result.

In this total list, we are storing rewards which will be collected

Here in this loop, we are calculating the rewards obtained in each episode by going over the loop. Each loop starts with an observation value which has been reset using reset () function.

Here we are running an instance of CartPole environment for 1000 timesteps, which will fetch the environment each time. There will be a small popup window displaying cart-pole using the render () function. Along with this, we are deciding the current action based upon the observation obtained by the agent’s previous actions by calling the basic_policy () function.

Now next we have used step () function which will return four values which are **observation: object type (**this will tell about the observation of the environment**)**, **reward: float type (**amount of reward received by previous action**)**, **done: Boolean type (**this tells about whether the episode has terminated or not**)**, and **info: dictionary type (**the information provided by this dictionary is used for debugging and also for learning about the environments**)**. So in this loop, at each timestep, the **agent chooses an action** and the **environment** returns an **observation** and a **reward**.

Lastly in the loop once the done variable returns the value as “true”, then we come out of the loop and append the episode_rewards value to the totals[].

Finally, we are printing the totals [] list which has the maximum rewards values and along with this, we are printing the maximum reward obtained. Most importantly we are using close () function to close the pop-up window otherwise, the program may crash.

The code of this implementation can be found here

To implement Reinforcement learning we need to have some predefined method of learning i.e. how the agent will be understanding which action should be taken to maximize the rewards.

For the above-mentioned reason, we have two methods used for learning which are as follows:-

In this method, the agent completes the episode (i.e. reaches a “terminal state”) and then looks at the **total rewards to see how well it has performed**. Here in the Monte Carlo method, the **rewards collection is done at the end of the episode** and then on the basis of the result, **the maximum expected future reward is calculated**.

**For example**: If we understand through the **dog example, **here the agent i.e. dog will be using **Monte Carlo approach** and completing the action of bringing the thrown ball back and then analyze the rewards which it received. On the basis of rewards, the dog will decide which actions should be performed in near future to maximize the reward.

When we look at the TD Learning method, here the rewards obtained are **analyzed after each step** and then on the basis of this only **maximum expected future reward** is calculated. Therefore, after each step, the agent decides which action should be taken to get maximum rewards.

**For example**: Again using the dog example, in this instance dog will look for appreciation after each step i.e. even if it starts running after the ball and looks at the owner appreciating then the dog will think of getting the reward. Similarly, if the dog is sitting and not going after the ball then the owner’s scolding will help the dog to understand and will make the dog change the action.

During any reinforcement learning problem the agent tries to build an optimal policy but at the same time it faces the dilemma of exploring new states while maximizing the rewards at the same time. This phenomenon faced by the agent is known as **Exploration vs. Exploitation Trade-off**.

To be precise, **Exploration** is finding more information about the environment and discovering new actions which can be taken to get more rewards. Whereas, **Exploitation Trade-off** is exploiting known information to maximize the rewards.

**In our dog example**, let’s consider the owner does not scold the dog for not bringing the ball and the dog is very lazy. So whenever the owner will throw the ball, the dog will not leave its place since the dog is getting the reward in the form of rest and it keeps on resting which is analogous to **Exploitation. **But if the dog tries to bring the ball back and discover that it receives food as a reward. This is what **Exploration** will be termed as since dog explored some new actions to get new rewards.

This drawback arises because the agent in most cases **memorizes** one path and will never try to explore any other paths. So we want that the **agent not only continues to exploit new paths but also keep on searching for new paths**, this is decided by a **hyper-parameter** which suggest how much exploration and how much exploitation is needed.

We have already discussed that Reinforcement learning is the best possible option where information about that particular task/environment is limited. So now let’s look at such applications:** Playing the Games like Go/Chess
**AlphaGo, as mentioned earlier, is a computer program that registered a victory against one of the best players in the world. AlphaGo has used RL for deciding which move should be taken based on current inputs and actions.

**Robot Control
**Robots have learned to walk, run, dance, fly, play various sports and perform mundane tasks using RL.

**Online Advertising
**Using reinforcement learning, the definition of broadcasting advertisements is totally changed. Now the user views the ads at right time as per their history and interests. There are applications of RL which include

**Dialogue Generation
**A conversational agent speaks a sentence based on future looking, i.e. long-term reward. So making involving both the speakers more involved in the conversation.

**Education and Training
**There are numerous online educational platforms which are looking to incorporate RL in their tutoring systems and personalized learning. With the use of RL, the students will have the advantage of having the study material as per their learning capability.

**Health and Medicine
**The RL shows results by looking at similar problems and how they were dealt with. Through this, RL suggests the optimal treatment policies for the patients.

**Finance
**RL has been used to perform various financial tasks like the stock prediction on the basis of past and present performance of stocks. There have been many companies trying to bring in Reinforcement learning application in their company’s functionality by bringing system for Trade execution.

**Real world implementation of Reinforcement Learning
**To get a deeper insight into how reinforcement learning is implemented, have a look at the following links for the same:-

- Reinforcement learning for Stock Prediction
- Reinforcement learning for meal-planning
- Reinforcement learning for Sports Betting

The post Beginner’s Guide to Reinforcement Learning appeared first on StepUp Analytics.

]]>The post Ridge Regression and Its Application appeared first on StepUp Analytics.

]]>The OLS function works quite well when some assumptions like a linear relationship, no autocorrelation, homoscedasticity, more observations than variables, normal distribution of the residuals and No or little multicollinearity are fulfilled.

But in many real-life scenarios, these assumptions are violated. In those cases, we need to find alternative approaches to provide solutions. Penalized/Regularized regression techniques such as ridge, lasso and elastic net regression work very well in these cases. In this article, I have tried to explain the ridge regression technique which is a way of creating regression models when the number of predictor variables of a dataset is more than the number of observations or when the data suffers from multicollinearity (independent variables are highly correlated).

*Regularization** *methods provide a means to control our regression coefficients, which can help to reduce the variance and decrease the sampling error. Ridge regression belongs to a class of regression tools that use L2 regularization. L2 regularization works as a small addition to the OLS function that weights the residuals in a particular way to make the parameters more stable. **The **L2 penalty parameter, which equals the square of the magnitude of coefficients, is given by,

And the regression function is given by,

The amount of the penalty can be fine-tuned using a constant called lambda (λ). Selecting a good value for λ is critical. When λ = 0, the penalty term has no effect and ridge regression produces classical least square coefficients. If λ = ∞, the impact of the penalty grows and all coefficients are shrunk to zero. The ideal penalty is therefore somewhere in between 0 and ∞.

In this way, ridge regression puts constraints on the magnitude of the coefficients and help to reduce the magnitude and fluctuations of the coefficients and progressively shrinks them towards zero. This will definitely help to reduce the variance of the model. The outcome is typically a model that fits the training data less well than OLS but generalizes better because it is less sensitive to extreme variance in the data such as outliers.

Note that, in contrast to the ordinary least square regression, ridge regression is highly affected by the scale of the predictors. Therefore, it is better to standardize (i.e., scale) the predictors before applying the ridge regression so that all the predictors are on the same scale.

**Advantages and Disadvantages Of Ridge Regression**

- Least squares regression doesn’t differentiate “important” from “less-important” predictors in a model, so it includes all of them. This leads to overfitting a model and failure to find unique solutions.
**Ridge regression avoids these problems.**

- Ridge regression works in part because it doesn’t require unbiased estimators; while least squares produce unbiased estimates; its variances can be so large that they may be wholly inaccurate.

- Ridge regression adds just enough bias to make the estimates reasonably reliable approximations to true population values.

- One important advantage of the ridge regression is that it still performs well, compared to the ordinary least square method in a situation where you have a large multivariate data with the number of predictors (p) larger than the number of observations (n).

- The ridge estimator is especially good at improving the least-squares estimate when multicollinearity is present.

- Firstly ridge regression includes all the predictors in the final model, unlike the stepwise regression methods which will generally select models that involve a reduced set of variables.

- A ridge model does not perform feature selection. If a greater interpretation is necessary where we need to reduce the signal in our data to a smaller subset then a lasso model may be preferable.

- Ridge regression shrinks the coefficients towards zero, but it will not set any of them exactly to zero. The lasso regression is an alternative that overcomes this drawback.

Here I have given the link of a website below, where you can get the mathematical and geometric interpretation of Ridge regression **More Info**

Loading the MASS package to get the data set

**library (MASS)
**

Splitting the dataset in training and testing data

**train <- data [1:400,]
**

Loading libraries required for Ridge regression

**library(tidyverse)
**

**library (glmnet)
**

We need to know the **glmnet **package

- The glmnet package provides the function glmnet () for ridge regression. Rather than accepting a formula and data frame, it requires a vector input and matrix of predictors.

- We must specify alpha = 0for ridge regression.( for lasso alpha = 1 and for elastic net, 0 < = alpha < = 1)

- Ridge regression also involves tuning a hyperparameter lambda ( λ).[ Discussed earlier ]

- In case of classification or penalized logistic regression method we mention
**family = “binomial”**.

For more details about this package: **More Info**

There is another function **lm.ridge ()** in **MASS** package which can also be used. Please see the link below for more details about the function. **More Info** [Page Number: 79]

Preparing the training data set for training the regression model

**x.train <- model.matrix (medv~., train) [,-1]**

We save the response variable housing price in a vector** y.train
**

We need to find the best value for lambda for the given data set with the function **cv.glmnet()
**

Displaying the best lambda value

**cv$lambda.min**

We fit the final model on the training data by adding the best lambda value.

**model_ridge <- glmnet (x.train, y.train, alpha = 0, lambda = cv$lambda.min)**

Displaying the regression coefficients below

**coef (model_ridge)**

Preparing the test data set to be used as a data matrix and discarding the intercept for predicting the values of the response variable.

**x.test <- model.matrix (medv ~., test)[,-1]**

We save the predicted values of the response variable Housing price in a vector **prediction_ridge
**

Saving the RMSE, SSE and MAPE value of the predicted values of the test data set in **Accuracy_ridge
**

Now we fit the multiple linear regression model on the training data set

**names (train)
**

From the summary of the model we can find the p value of the individual predictor variables and decide which variables to be kept in the model

**summary (model_lm)
**

We need to check the multicollinearity with the help of the function **vif () **from** car **package.

**vif (model_lm)**

We also need to exclude the predictor variables with high vif values to avoid multicollinearity. Though we may allow multicollinearity up to a certain level.

**model_lm <- lm (medv ~ crim+zn+nox+rm+dis+rad+ptratio+lstat, data=train) **

Below I have mentioned the summary of the updated final model with all the significant variables and the vif values of the variables. The values of the R square and adjusted R square are pretty close, which also shows that the present predictor variables in the model are pretty significant.

**summary (model_lm)**

**vif (model_lm)**

We compute the prediction of the test data set with multiple linear regression which was trained using the training dataset

**prediction_lm <- predict (model_lm, test [,-14])**

We find out the RMSE, SSE, and MAPE of the regression model and save them in **Accuracy_lm
**

We save the RMSE, SSE and MAPE values of both linear and ridge regression models in **Accuracy.
**

From the **Accuracy** mentioned above, it is clear that even though the least square estimates are unbiased; the accuracy of the model is compromised. By adding a degree of bias to the regression estimates, ridge regression reduces the standard errors. But with other models like the lasso and elastic net regression, we have a possibility of getting a better accuracy value.

This is because complicated models tend to overfit the training data. In my next article, I will introduce you to lasso and elastic net regression and explain the comparative advantage of using these models over multiple linear or ridge regression models.

To learn more on Statistics for Data Science **Read**

The post Ridge Regression and Its Application appeared first on StepUp Analytics.

]]>The post Supervised vs Unsupervised Machine Learning appeared first on StepUp Analytics.

]]>So what is required for creating such machine learning systems? Following are the things required in creating such machine learning systems:

**Data –**Input data is required for predicting the output.**Algorithms –**Machine Learning is dependent on certain statistical algorithms to determine data patterns.**Automation –**It is the ability to make systems operate automatically.**Iteration –**The complete process is an iterative i.e. repetition of the process.**Scalability –**The capacity of the machine can be increased or decreased in size and scale.**Modeling –**The models are created according to the demand by the process of modeling.

Machine Learning methods are classified into certain categories. These are:

**Supervised Learning –**In this method, input and output are provided to the computer along with feedback during the training. The accuracy of predictions by the computer during training is also analyzed. The main goal of this training is to make computers learn how to map input to the output.

**Unsupervised Learning –**In this case, no such training is provided leaving computers to find the output on its own. Unsupervised learning is mostly applied to transactional data. It is used in more complex tasks. It uses another approach of iteration known as deep learning to arrive at some conclusions.

**Reinforcement Learning –**This type of learning uses three components namely – agent, environment, action. An agent is the one that perceives its surroundings, an environment is the one with which an agent interacts and acts in that environment. The main goal in reinforcement learning is to find the best possible policy.

Machine learning makes use of processes similar to that of data mining. Machine learning algorithms are described in terms of target function(f) that maps input variable (x) to an output variable (y). This can be represented as:

**y=f(x)**

There is also an error e which is the independent of the input variable x. Thus the more generalized form of the equation is:

**y=f(x) + e**

In machine, the mapping from x to y is done for predictions. This method is known as predictive modeling to make the most accurate predictions. There are various assumptions for this function.

Everything is dependent on machine learning. Find out what are the benefits of machine learning.

**Decision making is faster –**Machine learning provides the best possible outcomes by prioritizing the routine decision-making processes.**Adaptability –**Machine Learning provides the ability to adapt to new changing environment rapidly. The environment changes rapidly due to the fact that data is being constantly updated.**Innovation –**Machine learning uses advanced algorithms that improve the overall decision-making capacity. This helps in developing innovative business services and models.**Insight –**Machine learning helps in understanding unique data patterns and based on which specific actions can be taken.**Business growth –**With machine learning overall business process and workflow will be faster and hence this would contribute to the overall business growth and acceleration.**The outcome will be good –**With machine learning the quality of the outcome will be improved with lesser chances of error.

The post Supervised vs Unsupervised Machine Learning appeared first on StepUp Analytics.

]]>