The post Neural Networks using H2o Package in R appeared first on StepUp Analytics.

]]>An artificial neural network (or neural network for short) is a predictive model motivated by the way the brain operates. Think of the brain as a collection of neurons wired together. Each neuron looks at the outputs of the other neurons that feed into it, does a calculation, and then either fires or doesn’t.

Accordingly, artificial neural networks consist of artificial neurons, which perform similar calculations over their inputs. Neural networks can solve a wide variety of problems like handwriting recognition and face detection, and they are used heavily in deep learning, one of the trendiest subfields of data science. However, most neural networks are “**Black Boxes**”—inspecting their details don’t give you much understanding of how they’re solving a problem. And large neural networks can be difficult to train. For most problems you’ll encounter as a budding data scientist, they’re probably not the right choice. These might graphically look like

Where in the input layer consists of the independent variables and the output layer in the variable of interest that is the effect on which is to be measured or checked. The hidden layers are user-defined and play a crucial role in the accuracy of the model. The greater the number of the layers, the more complex our Neural Network model becomes. Such networks with multiple layers are called Deep Neural Networks. To decide the number of layers or the number of nodes in a layer is quite a difficult task.

Unfortunately, there is no reliable rule to determine the number of neurons in the hidden layer. The appropriate number depends on the number of input nodes, the amount of training data, the amount of noisy data, and the complexity of the learning task, among many other factors. It is on the user to decide these parameters.

It has been proven that a neural network with at least one hidden layer of sufficient neurons is a universal function approximator. This means that neural networks can be used to approximate any continuous function to an arbitrary precision over a finite interval.

*Step by Step Analysis*

There are a couple of different ways to build neural networks in R but the main focus in this blog will be using the H2o package for our analysis. We’ll be using the Concrete Data throughout our analysis, which is easily available on the UCI Machine Learning repository. We’ll follow a certain ladder of steps which will be a little different from the standard methods. Given by:

Initialise the H2o package.- Read in the data
- Data pre-processing if required
- Convert data into H2o readable format
- Split the data into training and testing sets
- Check for the accuracy of the model on the test data

First lets read in the data from the csv file and check its structure:

concrete <- read.csv(“concrete.csv”)

str(concrete)

So our data here has 1030 observations with 9 variables. Our variable of interest is the strength of concrete. We aim to perform a regression task, where we try to predict the strength of the concrete on the basis of the remaining variables.

Before we proceed to the analysis part we do some changes in the data to ensure robustness in the final model. We can clearly see that the data has very high and low magnitude values. For this we need to first normalize the data:

normalise <- function(x){

return((x-min(x))/(max(x)-min(x)))

}

data <- as.data.frame(lapply(concrete,normalise))

str(data)

This now looks much better and cleaner. Moving on to the steps defined above, we begin our analysis using the H2o package.

install.packages(“h2o”)

library(h2o)

h2o.init(max_mem_size = “2G”, nthreads = 2, ip=”localhost”, port=54321)

This way we initialize the H2o package. Now we have the data that can be read in by the h2o package. Converting the data might take a few seconds, depending upon the configuration of your laptop. Wait until the conversion shows 100%. To check the first few values you can use the head function.

d.hex <- as.h2o(data,destination_frame = “d.hex”)

head(d.hex)

For splitting the dataset, instead of the

set.seed(99)

split <-h2o.splitFrame(data=d.hex,ratios = 0.75)

train <- split[[1]] > test <- split[[2]]

Now comes the main part of our analysis:

model_nn <- h2o.deeplearning(x=1:8,y=”strength”, training_frame = train,hidden =5,model_id = “model_nn”)

The code here explains that our independent variables are the first 8 variables and dependent variable is the “strength” variable. We use 5 nodes in the single hidden layer and assign model_nn as the model name.

To check the performance of the model on the test data:

perf <- h2o.performance(model_nn,test)

perf

We get a few performance measures but we stick to only RMSE at this point. The next step is to make predictions and to check how these predictions match with the test data.

pred<- as.data.frame(h2o.predict(model_nn, test))

test1 <- as.data.frame(test)

cor(pred,test1$strength)

In order to find
the correlation between the predicted values and the strength variable in the test
data we need to transform both to into a data frame first. **The correlation comes out to be 82% which is quite good**. But is
much lower than the standard “neuralnet” method in R. You can always add more
nodes and layers to check how accuracy changes and can select a model that
gives a better correlation value.

This is how you can use H2o package to build Deep Learning models in R.

The post Neural Networks using H2o Package in R appeared first on StepUp Analytics.

]]>The post H20 Package: Classification Using Logistic Regression appeared first on StepUp Analytics.

]]>But in this

Here logistic regression comes from the underlying assumption of the GLMs which I will discuss in the next section. I will be talking about two ways of carrying out logistic regression in R. One being the standard method of using the **glm** function from the base package and the other being the h2o.glm function from the h2o package in R. We will also see how the accuracy has improved from the first model to the second model.

**What are
Generalized linear models and how do they differ from the classical linear
models?**

We already know that the distribution of the error term in the linear models is assumed to follow a normal distribution. But in the cases where we have binary classes for the response variable, we assume that the error term does not follow the normal distribution. Rather it follows the logistic distribution, given by the Cumulative density function:

Hence the term logistic regression. The above cdf can be graphically represented as:

Which is also known as the sigmoid function. The output of this function will always be between 0 and 1.

For the analysis I’ll be using an example dataset and the following steps will be followed:

- Reading the data
- Splitting the data into training and testing sets
- Applying glm on the training set
- Prediction using the test data
- Calculating the accuracy

The dataset considered here contains 21 variables and 3168 observations, where the label variable represents if the voice of an individual considered is a male or female. Before we step forward for the analysis, there is some pre-processing of the data required. We will first subset the data and consider only those variables that are important for our analysis and then convert the label variable into factor variable with levels 1 and 2 representing female and male respectively.

The data looks somewhat like this:

data <- read.csv(“voice.csv”)

head(data)

data$label <- factor(data$label)

str(data)

names(data)

Along with label we have a set of 20 other variables that are the descriptive statistics upon which our response variable depends.

Now, we attempt to partition the data into training and testing data sets. For

library(caret)

set.seed(99)

Train=createDataPartition(data$label,p=0.75, list=F)

training <- data[Train,]

testing <- data[-Train,]

Use set.seed(99) for replicability purposes. In the nest

fit <- glm(label~Q25+Q75+sp.ent+sfm+mode+meanfun+minfun, data=training, family = binomial(link = “logit”))

To check what model has been formed and in order to interpret the results, we use the summary function to extract all the information possible.

summary(fit)

Results:

The call function shows that we have rightly executed the code and that our response and predictor variables are rightly placed after and before the ~. From the table, we can see that all of the independent variables used in the modeling are significant at 10% level of significance. Although, we can remove mode variable if we consider a 5% level of significance.

Apart from the table, it is necessary to note that the AIC value obtained from this model is 440. We will try to reduce this using a different method in the next section. But before that, we now try to calculate how well this model performs on testing or unseen data.

p <- predict(fit,newdata = testing,type = “response”)

head(p) #gives the first 6 observations

The serial numbers simply mean that the first 10 observations might have been considered in the testing dataset and like that for observation 14. In order to check how accurate the above model classifies the gender label on test data, we set a threshold of 0.5. That is the observations with probability greater than 0.5 will be considered a male else a female.

To measure the accuracy and to check for the misclassification, we form a confusion matrix.

pred1 <- ifelse(p>0.5,2,1)

tab <- table(pred1,testing$label)

tab

Thus, from the table, we can see that there 379 observations that were correctly classified as females and 386 correctly classified males. Now, in order to calculate the accuracy we use:

Which gives a great accuracy of 96.6%.

Before we proceed to the analysis it is necessary to understand what this package is and what does it do?

H2o is a leading open source platform that makes it easy for financial services, insurance companies, and healthcare companies to deploy AI and deep learning to solve complex problems. To make it easier for non-engineers to create complete analytic workflows, H2O’s platform includes interfaces for R, Python and many more tools.

The steps that will be followed here are quite different from the previous case:

- Initialise the H2o package.
- Read in the data
- Data pre-processing if required
- Convert data into H2o readable format
- Split the data into training and testing sets
- Check for the accuracy of the model on the test data

Following the above steps:

library(h2o)

h2o.init(max_mem_size = “2G”, nthreads = 2, ip=”localhost”, port=54321) data <- read.csv(“voice.csv”)

data1 <- data[,c(4,5,9,10,11,13,14,21)]

d.hex <- as.h2o(data1,destination_frame = “d.hex”)

Now we have the data that can be read in by the h2o package. Converting the data might take a few seconds, depending upon the configuration of your laptop. Wait until the conversion shows 100%. To check the first few values you can use the head function.

For splitting the dataset, instead of the caret package we use the inbuilt functions from the h2o package only.

head(d.hex)

set.seed(99)

split <- h2o.splitFrame(data = d.hex,ratios = 0.75)

train <- split[[1]]

test <- split[[2]]

After running these functions, we can now carry out the glm function from the h2o package only on the training data and then check it on the test data for how accurately it classifies categories. Again running these codes might take a few seconds.

fit3 <- h2o.glm(x=1:7,y=8,training_frame = train,family = “binomial”, link=”logit”)

h2o.performance(fit3,test)

The performance function will give a list of variables of performance measures like RMSE, Logloss, AUC etc. But in this blog, I intend to concentrate only on the AIC and the confusion matrix.

We can clearly see that the AIC value lower from the first model. Which is an indication of a better and a robust model.

We extract the confusion matrix to measure accuracy and misclassification. The beauty of this package is that it contains all the necessary functions required for the analysis so that you don’t have referred to different packages for different functions.

Clearly 390 females and 373 males are rightly classified. The accuracy is given by: (390+373)/(390+373+13+8) = 97%. Which slightly greater than the previous model. But there is no significant difference in the accuracy between the two models.

In order to check the predictions made for each observation in the test data and the how strong the probability is for the prediction made we use the following function:

So the prediction made for the first observation is a male with probability **0.99966 **which is quite high and so on.

And this is how you can use two different methods of carrying out logistic regression on the same dataset.

**Download** the data Used in this blog. Read the latest articles on **Machine Learning**

The post H20 Package: Classification Using Logistic Regression appeared first on StepUp Analytics.

]]>