H20 Package: Classification Using Logistic Regression
Here in this blog, we will explore
But in this
Here logistic regression comes from the underlying assumption of the GLMs which I will discuss in the next section. I will be talking about two ways of carrying out logistic regression in R. One being the standard method of using the glm function from the base package and the other being the h2o.glm function from the h2o package in R. We will also see how the accuracy has improved from the first model to the second model.
What are Generalized linear models and how do they differ from the classical linear models?
We already know that the distribution of the error term in the linear models is assumed to follow a normal distribution. But in the cases where we have binary classes for the response variable, we assume that the error term does not follow the normal distribution. Rather it follows the logistic distribution, given by the Cumulative density function:
Hence the term logistic regression. The above cdf can be graphically represented as:
Which is also known as the sigmoid function. The output of this function will always be between 0 and 1.
For the analysis I’ll be using an example dataset and the following steps will be followed:
- Reading the data
- Splitting the data into training and testing sets
- Applying glm on the training set
- Prediction using the test data
- Calculating the accuracy
The dataset considered here contains 21 variables and 3168 observations, where the label variable represents if the voice of an individual considered is a male or female. Before we step forward for the analysis, there is some pre-processing of the data required. We will first subset the data and consider only those variables that are important for our analysis and then convert the label variable into factor variable with levels 1 and 2 representing female and male respectively.
The data looks somewhat like this:
data <- read.csv(“voice.csv”)
data$label <- factor(data$label)
Along with label we have a set of 20 other variables that are the descriptive statistics upon which our response variable depends.
Now, we attempt to partition the data into training and testing data sets. For
training <- data[Train,]
testing <- data[-Train,]
Use set.seed(99) for replicability purposes. In the nest
fit <- glm(label~Q25+Q75+sp.ent+sfm+mode+meanfun+minfun, data=training, family = binomial(link = “logit”))
To check what model has been formed and in order to interpret the results, we use the summary function to extract all the information possible.
The call function shows that we have rightly executed the code and that our response and predictor variables are rightly placed after and before the ~. From the table, we can see that all of the independent variables used in the modeling are significant at 10% level of significance. Although, we can remove mode variable if we consider a 5% level of significance.
Apart from the table, it is necessary to note that the AIC value obtained from this model is 440. We will try to reduce this using a different method in the next section. But before that, we now try to calculate how well this model performs on testing or unseen data.
p <- predict(fit,newdata = testing,type = “response”)
head(p) #gives the first 6 observations
The serial numbers simply mean that the first 10 observations might have been considered in the testing dataset and like that for observation 14. In order to check how accurate the above model classifies the gender label on test data, we set a threshold of 0.5. That is the observations with probability greater than 0.5 will be considered a male else a female.
To measure the accuracy and to check for the misclassification, we form a confusion matrix.
pred1 <- ifelse(p>0.5,2,1)
tab <- table(pred1,testing$label)
Thus, from the table, we can see that there 379 observations that were correctly classified as females and 386 correctly classified males. Now, in order to calculate the accuracy we use:
Which gives a great accuracy of 96.6%.
Boosted Accuracy from h2o package
Before we proceed to the analysis it is necessary to understand what this package is and what does it do?
H2o is a leading open source platform that makes it easy for financial services, insurance companies, and healthcare companies to deploy AI and deep learning to solve complex problems. To make it easier for non-engineers to create complete analytic workflows, H2O’s platform includes interfaces for R, Python and many more tools.
The steps that will be followed here are quite different from the previous case:
- Initialise the H2o package.
- Read in the data
- Data pre-processing if required
- Convert data into H2o readable format
- Split the data into training and testing sets
- Check for the accuracy of the model on the test data
Following the above steps:
h2o.init(max_mem_size = “2G”, nthreads = 2, ip=”localhost”, port=54321) data <- read.csv(“voice.csv”)
data1 <- data[,c(4,5,9,10,11,13,14,21)]
d.hex <- as.h2o(data1,destination_frame = “d.hex”)
Now we have the data that can be read in by the h2o package. Converting the data might take a few seconds, depending upon the configuration of your laptop. Wait until the conversion shows 100%. To check the first few values you can use the head function.
For splitting the dataset, instead of the caret package we use the inbuilt functions from the h2o package only.
split <- h2o.splitFrame(data = d.hex,ratios = 0.75)
train <- split[]
test <- split[]
After running these functions, we can now carry out the glm function from the h2o package only on the training data and then check it on the test data for how accurately it classifies categories. Again running these codes might take a few seconds.
fit3 <- h2o.glm(x=1:7,y=8,training_frame = train,family = “binomial”, link=”logit”)
The performance function will give a list of variables of performance measures like RMSE, Logloss, AUC etc. But in this blog, I intend to concentrate only on the AIC and the confusion matrix.
We can clearly see that the AIC value lower from the first model. Which is an indication of a better and a robust model.
We extract the confusion matrix to measure accuracy and misclassification. The beauty of this package is that it contains all the necessary functions required for the analysis so that you don’t have referred to different packages for different functions.
Clearly 390 females and 373 males are rightly classified. The accuracy is given by: (390+373)/(390+373+13+8) = 97%. Which slightly greater than the previous model. But there is no significant difference in the accuracy between the two models.
In order to check the predictions made for each observation in the test data and the how strong the probability is for the prediction made we use the following function:
So the prediction made for the first observation is a male with probability 0.99966 which is quite high and so on.
And this is how you can use two different methods of carrying out logistic regression on the same dataset.