The post H20 Package: Classification Using Logistic Regression appeared first on StepUp Analytics.

]]>But in this

Here logistic regression comes from the underlying assumption of the GLMs which I will discuss in the next section. I will be talking about two ways of carrying out logistic regression in R. One being the standard method of using the **glm** function from the base package and the other being the h2o.glm function from the h2o package in R. We will also see how the accuracy has improved from the first model to the second model.

**What are
Generalized linear models and how do they differ from the classical linear
models?**

We already know that the distribution of the error term in the linear models is assumed to follow a normal distribution. But in the cases where we have binary classes for the response variable, we assume that the error term does not follow the normal distribution. Rather it follows the logistic distribution, given by the Cumulative density function:

Hence the term logistic regression. The above cdf can be graphically represented as:

Which is also known as the sigmoid function. The output of this function will always be between 0 and 1.

For the analysis I’ll be using an example dataset and the following steps will be followed:

- Reading the data
- Splitting the data into training and testing sets
- Applying glm on the training set
- Prediction using the test data
- Calculating the accuracy

The dataset considered here contains 21 variables and 3168 observations, where the label variable represents if the voice of an individual considered is a male or female. Before we step forward for the analysis, there is some pre-processing of the data required. We will first subset the data and consider only those variables that are important for our analysis and then convert the label variable into factor variable with levels 1 and 2 representing female and male respectively.

The data looks somewhat like this:

data <- read.csv(“voice.csv”)

head(data)

data$label <- factor(data$label)

str(data)

names(data)

Along with label we have a set of 20 other variables that are the descriptive statistics upon which our response variable depends.

Now, we attempt to partition the data into training and testing data sets. For

library(caret)

set.seed(99)

Train=createDataPartition(data$label,p=0.75, list=F)

training <- data[Train,]

testing <- data[-Train,]

Use set.seed(99) for replicability purposes. In the nest

fit <- glm(label~Q25+Q75+sp.ent+sfm+mode+meanfun+minfun, data=training, family = binomial(link = “logit”))

To check what model has been formed and in order to interpret the results, we use the summary function to extract all the information possible.

summary(fit)

Results:

The call function shows that we have rightly executed the code and that our response and predictor variables are rightly placed after and before the ~. From the table, we can see that all of the independent variables used in the modeling are significant at 10% level of significance. Although, we can remove mode variable if we consider a 5% level of significance.

Apart from the table, it is necessary to note that the AIC value obtained from this model is 440. We will try to reduce this using a different method in the next section. But before that, we now try to calculate how well this model performs on testing or unseen data.

p <- predict(fit,newdata = testing,type = “response”)

head(p) #gives the first 6 observations

The serial numbers simply mean that the first 10 observations might have been considered in the testing dataset and like that for observation 14. In order to check how accurate the above model classifies the gender label on test data, we set a threshold of 0.5. That is the observations with probability greater than 0.5 will be considered a male else a female.

To measure the accuracy and to check for the misclassification, we form a confusion matrix.

pred1 <- ifelse(p>0.5,2,1)

tab <- table(pred1,testing$label)

tab

Thus, from the table, we can see that there 379 observations that were correctly classified as females and 386 correctly classified males. Now, in order to calculate the accuracy we use:

Which gives a great accuracy of 96.6%.

Before we proceed to the analysis it is necessary to understand what this package is and what does it do?

H2o is a leading open source platform that makes it easy for financial services, insurance companies, and healthcare companies to deploy AI and deep learning to solve complex problems. To make it easier for non-engineers to create complete analytic workflows, H2O’s platform includes interfaces for R, Python and many more tools.

The steps that will be followed here are quite different from the previous case:

- Initialise the H2o package.
- Read in the data
- Data pre-processing if required
- Convert data into H2o readable format
- Split the data into training and testing sets
- Check for the accuracy of the model on the test data

Following the above steps:

library(h2o)

h2o.init(max_mem_size = “2G”, nthreads = 2, ip=”localhost”, port=54321) data <- read.csv(“voice.csv”)

data1 <- data[,c(4,5,9,10,11,13,14,21)]

d.hex <- as.h2o(data1,destination_frame = “d.hex”)

Now we have the data that can be read in by the h2o package. Converting the data might take a few seconds, depending upon the configuration of your laptop. Wait until the conversion shows 100%. To check the first few values you can use the head function.

For splitting the dataset, instead of the caret package we use the inbuilt functions from the h2o package only.

head(d.hex)

set.seed(99)

split <- h2o.splitFrame(data = d.hex,ratios = 0.75)

train <- split[[1]]

test <- split[[2]]

After running these functions, we can now carry out the glm function from the h2o package only on the training data and then check it on the test data for how accurately it classifies categories. Again running these codes might take a few seconds.

fit3 <- h2o.glm(x=1:7,y=8,training_frame = train,family = “binomial”, link=”logit”)

h2o.performance(fit3,test)

The performance function will give a list of variables of performance measures like RMSE, Logloss, AUC etc. But in this blog, I intend to concentrate only on the AIC and the confusion matrix.

We can clearly see that the AIC value lower from the first model. Which is an indication of a better and a robust model.

We extract the confusion matrix to measure accuracy and misclassification. The beauty of this package is that it contains all the necessary functions required for the analysis so that you don’t have referred to different packages for different functions.

Clearly 390 females and 373 males are rightly classified. The accuracy is given by: (390+373)/(390+373+13+8) = 97%. Which slightly greater than the previous model. But there is no significant difference in the accuracy between the two models.

In order to check the predictions made for each observation in the test data and the how strong the probability is for the prediction made we use the following function:

So the prediction made for the first observation is a male with probability **0.99966 **which is quite high and so on.

And this is how you can use two different methods of carrying out logistic regression on the same dataset.

**Download** the data Used in this blog. Read the latest articles on **Machine Learning**

The post H20 Package: Classification Using Logistic Regression appeared first on StepUp Analytics.

]]>The post Introduction about Logistic Regression Model appeared first on StepUp Analytics.

]]>Do you know what type of variable is used in logistic regression… Don’t worry, if you don know

In simple linear regression the variables are one dependent and one independent, In multiple linear regression, there is more than one independent variable.

Understand one thing if your data is in continuous form then use only linear regression model, while on the other hand , if your data is in categorical form(e.g. positive and negative) and in binary form(0,1) then use only logistic regression. In this model the data been code in binary form. like 1 for positive and 0 for negative [just assumption].

**Logistic Regression:**

In statistics, **logistic regression**, or **logit regression**, or **logit model** is a regression model where the dependent variable (DV) is categorical.

Logistic regression measures the relationship between the categorical dependent variable and one or more independent variables by estimating probabilities using a logistic function, which is the cumulative logistic distribution. Thus, it treats the same set of problems as probit regression using similar techniques, with the latter using a cumulative normal distribution curve instead.

Logistic regression can be seen as a special case of the generalized linear model and thus similar to linear regression. The model of logistic regression, however, is based on quite different assumptions (about the relationship between dependent and independent variables) from those of linear regression. In particular, the key differences of these two models can be seen in the following two features of logistic regression.

- First, the conditional distribution y|x is a Bernoulli distribution rather than a Gaussian distribution, because the dependent variable is binary.
- Second, the predicted values are probabilities and are therefore restricted to (0,1) through the logistic distribution function because logistic regression predicts the
**probability**of particular outcomes.

Logistic regression is widely used in many fields such as medical, social media.

For example, in the medical field suppose a patient has a disease(like HIV) based on the observed characteristics of the patient(Age, Sex, various Blood Tests and Urine Tests).

Another example like if you want to predict the election result for some National party, or want to predict that whether the voter will vote for congress or democratic party, based on the age, sex, income, caste and many more characteristics.

Example

A group of 20 students spend between 0 and 6 hours studying for an exam. How does the number of hours spent studying affect the probability that the student will pass the exam?

The table shows the number of hours each student spent studying, and whether they passed (1) or failed (0).

Hours: pass

0.5 0

0.75 0

1 0

1.25 0

1.5 0

1.75 0

1.75 1

2 0

2.25 1

2.5 0

2.75 0

3 0

3.25 1

3.5 0

4 1

4.25 1

4.5 1

4.75 1

5 1

5.5 1

Logistic Equation: β+ β1*Hours = logit(p) →1

**p=probabilty of presence of the characteristics of interest **

How we will find the values of β and β1, I will tell you

We will calculate this like we have calculated in linear regression model

logit transformation is defined as the logged odds

odds = [1/1-p] and

logit(p)=ln(p/1-p) →2

By using the logistic equation we will get the values of β, β1…..

First of all resolve the eq 1 and 2

putting the value of logit(p) in eq2

f = β+ β1*Hours

onresolving this we will get the below equation:

p = [1/1+exp[-f]

Now put the values of hours and find the condition for a student, it will show the probability of passing the exam.

Now you will get the probabilities of Passing the exam if study for that much(in hours)

**Note: In the next article I will teach you how to make a logistics regression model using R.**

Till then learn the basics of logistics regression if you have doubts please write it in the comment box, its free

Example source You will get more on Logistic Regression Model

Graphical Representation for both Linear and Logistic Regression will be posted very soon till then stay tuned Thank you very much.

Article originally posted here

The post Introduction about Logistic Regression Model appeared first on StepUp Analytics.

]]>