# Naive Bayes Classifier

# Introduction To Naive Bayes

Naive Bayes has its foundation pillar from the concept of Bayes theorem explained by the theory of probability.

Probability is the chance of an event occuring. Probability can related to our regular life and it helps us to solve alot of real life issues.

There is probability for a single event, calculated as the proportion of cases where that particular event happens. Similarly, we have probability of a group of events, calculated as the proportion of cases where the group of events occur together.

Another one is that, if it is known that one event has already happened, what will be the probability that another event happens after that. Example, if A is the first event and B is the second event, then P(B|A) is the probability of event A taking place after occurrence of event B.

The equation goes like,

P(B | A) = P(B) * P(A | B) / P(A)

where,

A is the first event

B is the second event

P(B|A) is the probability of event A taking place after occurrence of event B

P(A|B) is the probability of event B taking place after occurrence of event A

P(A) is the probability of event A taking place

P(B) is the probability of event B taking place

This concept is called Bayes Theorem in probability. Naive Bayes Classifier depends on this concept to explain its theory.

The algorithm makes an assumption that the data has attributes independent of each other. But in reality they may be dependent in some way. If this assumption of independence holds, Naive Bayes performs better than other models. If all the attributes of the data are categorical, Naive Bayes works very well, though it can also be used with continuous attributes. However in case of numeric attributes, it makes another assumption that the numerical variable is normally distributed.

**Example**

In our example case, we will work on a data having 9134 records of customers. The attributes about each customers provided are “**Customer ID**“, “**State**“, “**Education**“, “**Employment Status**“, “**Gender**“, “**Location**“, “**Marital Status**“, “**Vehicle**” and “**Income**“.

The objective for building our model is to predict the income level of customers. The income is divided into two levels, high and low. The assumption being that the customers having income below 35000 is considered as low income customer, and those having income more than 35000 are high income customers.

The steps to be followed for the model building :

1. Import the data.

2. Data cleaning being an important part.

3. Creating a derived column with respect to the income column. The new column indicates only the income levels (high or low), based on the assumption made above.

4. Divide the data in 7:3 ratio. First part is training data that will be used to make the machine learn the data trend. Second part is to predict their income levels.

5. Then comes the step to see the predictions made by the model and check how accurate these predictions are.

So as explained above we start with our model building from the first step on wards.

### Import Data

1 2 3 |
setwd("C:/Users/Prithac/Desktop/step up analytics") data <- read.csv("NB_data.csv") |

The link to download the data set is given: **Click here**

### Data Cleaning

As mentioned above, our target is to predict income levels of customers. So we create a column stating the income levels, i.e., high and low, according to the income mentioned.

Let us set that if a customer has income more than 35000, then we keep him in the “high” slot, otherwise we set him “low”.

1 2 3 |
data1 <- data data1$inc <- ifelse(data$Income >= 35000, "High", "Low") |

Now, we remove the 9th variable (Income) as we are itself taking the income levels as a calculated field.

1 2 |
data1 <- data1[, -9] |

Few variables that is irrelevant with regards to this model should be removed. These may include, “Customer”, “Gender” and “Marital Status”. These variables should have no direct connections on determining income of the customers.

1 2 |
data1 <- data1[, c(-1,-5,-7)] |

Checking the structure of the variables,

1 2 |
str(data1) |

We see there are 9134 records and 6 variables structured in a data frame. Only the problem is that the variable “inc” in char data type, which is a problem. As there are only two levels in this variable, high and low, hence we have convert it into factor data type.

1 2 |
data1$inc <- as.factor(data1$inc) |

Now, finally the data looks good.

### Naive Bayes Classifier Model

Installing the libraries,

1 2 3 4 |
install.packages('e1071', repos = 'https://cran.rstudio.com/bin/windows/contrib/3.4/e1071_1.6-8.zip') install.packages('caret', repos = 'https://cran.rstudio.com/bin/windows/contrib/3.4/caret_6.0-79.zip') |

Loading them,

1 2 3 |
library(e1071) library(caret) |

Now dividing the data set into training and testing set, keeping the ratio as 7:3,

1 2 3 4 5 6 7 |
set.seed(2) random <- sample(2, nrow(data1), prob = c(0.7, 0.3), replace = T) data_train <- data1[random == 1, ] data_test <- data1[random == 2, ] |

Running the naive bayes function. Keeping “inc” as the dependent variable and considering all other 5 variables as independent variables (indicated with “.” sign). Running the model on the training set first.

1 2 3 4 |
data_nb <- naiveBayes(inc ~ . , data = data_train) data_nb |

On running “data_nb” we get to see the summary of the model run. We read it as,

Under the heading “A-priori probabilities”, we see that there is 49% chance of income of the testing dataset customers being low. Similarly 51% chance of income of the testing dataset customers being high.

Under the heading “Conditional probabilities”, we get the conditional probabilities of all the variables individually.

If the State is “Arizona”, the probability of the income being high is more than the probability of the income being low. Similarly, if the state is “California”, the probability of the income being low is more than the probability of the income being high. We read the rest in this manner.

Next, if the Education is “Bachelor”, the probability of the income being low is more than the probability of the income being high. Compared to “Master”, the probability of the income being high is much more than the probability of the income being low, which is logical.

We can read the other observations in the same way.

Now running the model on the test data and getting the predictions,

1 2 |
pred_nb <- predict(data_nb, data_test) |

The variable “pred_nb” stores the high and low levels corresponding to all the records. To read it properly let’s create a confusion matrix out of it,

1 2 |
confusionMatrix(table(pred_nb, data_test$inc)) |

The matrix shows very good result.

### Validation Observations

- The diagonal values are number of correct predictions and the off-diagonals are considered as number of wrong predictions. So we see that there are much lower wrong predictions (352 + 0)

as compared to the correct predictions (1382 + 1047). - Accuracy percent is much high (87%) which is a good indication.
- P-value is much lower than 0.05 (<2.2e-16), which is desired.
- Kappa statistic is also high (around 75%). This indicates that there is huge difference between the actual accuracy and the random accuracy.
- Sensitivity and Specificity is also closer to each other.

Hence with all these observations we can say it is a good model.

### Insights

- For State, customers living in “Arizona”, “Nevada”, “Oregon”, “Washington” has probabilities of income being high than being low. Those living in “California” has the opposite result. But we see that the variance among these two levels for all the States is less
- For Education, customers who holds the degrees of “Bachelor”, “Doctor” and “High School or Below” has probabilities of income being low than being high. But those holding “Master” shows opposite insight. Also “College” people has income level at a standard. Not high not even low, which is meaningful.
- For Employment Status, customers who are “Disabled”, “Retired”, “Unemployed” and also who are on “Medical Leave”, has low income, and their probability of getting a high income is exactly 0. Extremely opposite is the case of “Employed” customers. Their probability to get high income is perfectly 1 and that of getting low income is much lesser. This actually makes sense if we refer it with our real life.
- For Location, customers living in “Rural” and “Urban” has much much higher probability of income being high than being low. On the other hand, the ones living in “Suburban” has the opposite result. There the probability of income being low is more than being high. All of their variance is much higher.
- For Vehicle, customers having “Four-Door Car”, “Luxury Car” and “Two-Door Car” has higher probability of income being high than being low. Customers having “Luxury SUV”, “Sports Car” and “SUV” has just the opposite result. But we see that the variance among these two levels for all the Vehicles is less.