The post What Is Classification appeared first on StepUp Analytics.
]]>Unlike regression where you predict a continuous number, you use classification to predict a category. There is a wide variety of classification applications from medicine to marketing.
Let’s say you own a shop and you want to figure out if one of your customers is going to come visit your shop again or not. The answer to that question can only be a ‘Yes’ or ‘No’.
These kind of problems in Machine Learning are known as Classification problems.
Classification problems normally have a categorical output like a ‘yes’ or ‘no’, ‘1’ or ‘0’, ‘True’ or ‘false’. Let’s go through another example:
Say you want to check if on a particular day, a game of cricket is possible or not.
In this case the weather conditions are the dependent factors and based on them, the outcome can either be ‘Play’ or ‘Don’t Play’.
Just like Classification, there are two other types of problems in Machine Learning and they are:
Regression and Clustering
In the image above, we have the list for all the different algorithms or solutions used for each of the problems.
There are 5 types of algorithms used to solve classification problems and they are-
Based on the kind of problem statement and the data in hand, we decide the kind of classification algorithm to be used.
Decision tree builds classification or regression models in the form of a tree structure. It breaks down a data set into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes. A decision node has two or more branches and a leaf node represents a classification or decision. The topmost decision node in a tree which corresponds to the best predictor called root node. Decision trees can handle both categorical and numerical data.
It is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter. Even if these features depend on each other or upon the existence of the other features, all of these properties independently contribute to the probability that this fruit is an apple and that is why it is known as ‘Naive’.
Naive Bayes model is easy to build and particularly useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.
Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and P(x|c). Look at the equation below:
Above,
Random Forest is a supervised learning algorithm. Like you can already see from its name, it creates a forest and makes it somehow random. The “forest” it builds, is an ensemble of Decision Trees, most of the time trained with the “bagging” method. The general idea of the bagging method is that a combination of learning models increases the overall result.
To say it in simple words: Random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction.
One big advantage of random forest is, that it can be used for both classification and regression problems, which form the majority of current machine learning systems. I will talk about the random forest in classification since classification is sometimes considered the building block of machine learning. Below you can see how a random forest would look like with two trees:
Imagine a guy named Andrew, that wants to decide, to which places he should travel during a one-year vacation trip. He asks people who know him for advice. First, he goes to a friend, and asks Andrew where he traveled to in the past and if he liked it or not. Based on the answers, he will give Andrew some advice.
This is a typical decision tree algorithm approach. Andrews friend created rules to guide his decision about what he should recommend, by using the answers of Andrew.
Afterward, Andrew starts asking more and more of his friends to advise him and they again ask him different questions, where they can derive some recommendations from. Then he chooses the places that where recommend the most to him, which is the typical Random Forest algorithm approach.
The k-nearest-neighbours algorithm is a classification algorithm, and it is supervised: it takes a bunch of labelled points and uses them to learn how to label other points. To label a new point, it looks at the labelled points closest to that new point (those are its nearest neighbours), and has those neighbours vote, so whichever label the most of the neighbours have is the label for the new point (the “k” is the number of neighbours it checks).
k-Nearest Neighbour is a lazy learning algorithm which stores all instances correspond to training data points in n-dimensional space. When an unknown discrete data is received, it analyzes the closest k number of instances saved (nearest neighbors)and returns the most common class as the prediction and for real-valued data, it returns the mean of k nearest neighbors.
In the distance-weighted nearest neighbor algorithm, it weights the contribution of each of the k neighbors according to their distance using the following query giving greater weight to the closest neighbors.
Usually, KNN is robust to noisy data since it is averaging the k-nearest neighbors.
Logistic regression is a statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome. The outcome is measured with a dichotomous variable (in which there are only two possible outcomes). It is used to predict a binary outcome (1 / 0, Yes / No, True / False) given a set of independent variables. To represent the binary/categorical outcome, we use dummy variables.
You can also think of logistic regression as a special case of linear regression when the outcome variable is categorical, where we are using log of odds as the dependent variable. In simple words, it predicts the probability of occurrence of an event by fitting data to a logit function.
Logistic regression was developed by statistician David Cox in 1958. This binary logistic model is used to estimate the probability of a binary response based on one or more predictor (or independent) variables (features). It allows one to say that the presence of a risk factor increases the probability of a given outcome by a specific percentage.
Like all regression analyses, the logistic regression is predictive analysis. Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables.
Application of Logistic Regression: It’s being used in Healthcare, Social Sciences & various ML for advanced research & analytics.
Logistic Regression measures the relationship between the dependent variable (our label, what we want to predict) and the one or more independent variables (our features), by estimating probabilities using it’s underlying logistic function.
These probabilities must then be transformed into binary values in order to actually make a prediction. This is the task of the logistic function, also called the sigmoid function. The Sigmoid-Function is an S-shaped curve that can take any real-valued number and map it into a value between the range of 0 and 1, but never exactly at those limits. This values between 0 and 1 will then be transformed into either 0 or 1 using a threshold classifier.
The picture below illustrates the steps that logistic regression goes through to give you your desired output.
Below you can see how the logistic function (sigmoid function) looks like:
We want to maximize the likelihood that a random data point gets classified correctly, which is called Maximum Likelihood Estimation. Maximum Likelihood Estimation is a general approach to estimating parameters in statistical models. You can maximize the likelihood using different methods like an optimization algorithm.
Newton’s Method is such an algorithm and can be used to find the maximum (or minimum) of many different functions, including the likelihood function. Instead of Newton’s Method, you could also use Gradient Descent.
You may be asking yourself what the difference between logistic and linear regression is. Logistic regression gives you a discrete outcome but linear regression gives a continuous outcome. A good example of a continuous outcome would be a model that predicts the value of a house. That value will always be different based on parameters like it’s size or location. A discrete outcome will always be one thing (you have cancer) or another (you have no cancer).
It is a widely used technique because it is very efficient, does not require too many computational resources, it’s highly interpretable, it doesn’t require input features to be scaled, it doesn’t require any tuning, it’s easy to regularize, and it outputs well-calibrated predicted probabilities.
Like linear regression, logistic regression does work better when you remove attributes that are unrelated to the output variable as well as attributes that are very similar (correlated) to each other. Therefore Feature Engineering plays an important role in regards to the performance of Logistic and also Linear Regression. Another advantage of Logistic Regression is that it is incredibly easy to implement and very efficient to train. I typically start with a Logistic Regression model as a benchmark and try using more complex algorithms from there on.
Because of its simplicity and the fact that it can be implemented relatively easy and quick, Logistic Regression is also a good baseline that you can use to measure the performance of other more complex Algorithms.
A disadvantage of it is that we can’t solve non-linear problems with logistic regression since it’s decision surface is linear. Just take a look at the example below that has 2 binary features from 2 examples.
It is clearly visible that we can’t draw a line that separates these 2 classes without a huge error. To use a simple decision tree would be a much better choice.
Logistic Regression is also not one of the most powerful algorithms out there and can be easily outperformed by more complex ones. Another disadvantage is its high reliance on a proper presentation of your data. This means that logistic regression is not a useful tool unless you have already identified all the important independent variables. Since its outcome is discrete, Logistic Regression can only predict a categorical outcome. It is also an Algorithm that is known for its vulnerability to overfitting.
Like I already mentioned, Logistic Regression separates your input into two „regions” by a linear boundary, one for each class. Therefore it is required that your data is linearly separable, like the data points in the image below:
In other words: You should think about using logistic regression when your Y variable takes on only two values (e.g when you are facing a classification problem). Note that you could also use Logistic Regression for multiclass classification, which will be discussed in the next section.
Importing the Essential libraries
Importing the Dataset
Splitting the dataset into the Training set and Test set
Feature Scalling
Fitting logistic regression to the training set
Predicting the Test Set Result
Making the Confusion Matrix
Output of Confusion Matrix
Tips – so the confusion matrix it means that 65+24 =89 are the correct predictions and 8+3 =11 are the incorrect predictions.
Visualizing the Training set Results
Output of the Training Set
Visualizing the Test Set
Output of the Test Set
The post What Is Classification appeared first on StepUp Analytics.
]]>The post Scikit Learn Tutorial and Cheat Sheet appeared first on StepUp Analytics.
]]>Pre-Processing | |
Function | Description |
sklearn.preprocessing.StandardScaler | Standardize features by removing the mean and scaling to unit variance |
sklearn.preprocessing.Imputer | Imputation transformer for completing missing values |
sklearn.preprocessing.LabelBinarizer | Binarize labels in a one-vs-all fashion |
sklearn.preprocessing.OneHotEncoder | Encode categorical integer features using a one-hot a.k.a one-of-K scheme |
sklearn.preprocessing.PolynomialFeatures | Generate polynomial and interaction features |
Regression | |
Function | Description |
sklearn.tree.DecisionTreeRegressor | A decision tree regressor |
sklearn.svm.SVR | Epsilon-Support Vector Regression |
sklearn.linear_model.LinearRegression | Ordinary least squares Linear Regression |
sklearn.linear_model.Lasso | Linear Model trained with L1 prior as regularized (a.k.a the lasso) |
sklearn.linear_model.SGDRegressor | Linear model fitted by minimizing a regularized empirical loss with SGD |
sklearn.linear_model.ElasticNet | Linear regression with combined L1 and L2 priors as regularizor |
sklearn.ensemble.RandomForestRegressor | A random forest regressor |
sklearn.ensemble.GradientBoostingRegressor | Gradient Boosting for regression |
sklearn.neural_network.MLPRegressor | Multi-layer Perceptron regressor |
classification | |
Function | Description |
sklearn.neural_network.MLPClassifier | Multi-layer Perceptron classifier |
sklearn.tree.DecisionTreeClassifier | A decision tree classifier |
sklearn.svm.SVC | C-Support Vector Classification |
sklearn.linear_model.LogisticRegression | Logistic Regression (a.k.a logit, Max Ent) classifier |
sklearn.linear_model.SGDClassifier | Linear classifiers (SVM, logistic regression, a.o.) with SGD training |
sklearn.naive_bayes.GaussianNB | Gaussain Naïve Bayes |
sklearn.neighbors.KNeighborsClassifier | Classifier implementing the k-nearest neighbors vote |
sklearn.ensemble.RandomForestClassifier | A random forest classifier |
sklearn.ensemble.GradientBoostingClassifier | Gradient Boosting for classification |
Clustering | |
Function | Description |
sklearn.cluster.Kmeans | K-Means clustering |
sklearn.cluster.DBSCAN | perform DBSCAN clustering from vector array or distance matrix |
sklearn.cluster.AgglomerativeClustering | Agglomerative clustering |
sklearn.cluster.SpectralBiclustering | Spectral bi-clustering |
Dimensionality Reduction | |
Function | Description |
sklearn.decomposition.PCA | Principal component analysis (PCA) |
sklearn.decomposition.LatentDirichletAllocation | Latent Dirichlet Allocation with online variational Bayes algorithm |
sklearn.decomposition.SparseCoder | Sparse coding |
sklearn.decomposition.DictionaryLearning | Dictionary learning |
Model Selection | |
Function | Description |
sklearn.model_selection.Kfold | K-Folds cross-validator |
sklearn.model_selection.StratifiedKFold | Stratified K-Flods cross-validator |
sklearn.model_selection.TimeSeriesSplit | Time Series cross-validator |
sklearn.model_selection.train_test_split | Split arrays or matrices into random train and test subsets |
sklearn.model_selection.GridSearchCV | Exhaustive search over specified parameter value for an estimator |
sklearn.model_selection.cross_val_score | Evaluate a score by cross-validation |
Metric | |
Function | Description |
sklearn.metrics.accuracy_score | Classification Metric: Accuracy classification score |
sklearn.metrics.log_loss | Classification Metric: Log loss, a.k.a logistic loss or cross-entropy loss |
sklearn.metrics.roc_auc_score | Classification Metric: Compute Receiver operating characteristics ROC |
sklearn.metrics.mean_absolute_error | Regression Metric: Mean absolute error regression loss |
sklearn.metrics.r2_score | Regression Metric: R^2 (coefficient of determination) regression score |
sklearn.metrics.label_ranking_loss | Ranking Metric: Compute Ranking loss measure |
sklearn.metrics.mutual_info_score | Clustering Metric: Mutual Information between two clustering. |
Miscellaneous | |
Function | Description |
sklearn.datasets.load_boston | Load and return the Boston house prices data set (regression) |
sklearn.datasets.make_classification | Generate a random n-class classification problem |
sklearn.feature_extraction.FeatureHasher | Implements feature hashing, a.k.a the hashing trick |
sklearn.feature_selection.SelectKBest | Select features according to the k highest scores |
sklearn.pipeline.Pipeline | Pipeline of transforms with a final estimator |
sklearn.semi_supervised.LabelPropagation | Label Propagation classifier for semi-supervised learning |
The post Scikit Learn Tutorial and Cheat Sheet appeared first on StepUp Analytics.
]]>The post Implementation Of Classification Algorithms appeared first on StepUp Analytics.
]]>Car Evaluation Database was derived from a simple hierarchical decision model originally developed for the demonstration of DEX M. Bohanec, V. Rajkovic: Expert system for decision making. Sistemica 1(1), pp. 145-157, 1990.). The model evaluates cars according to the following concept structure:
Input attributes are printed in lowercase. Besides the target concept (CAR), the model includes three intermediate concepts: PRICE, TECH, and COMFORT. Every concept is in the original model related to its lower level descendants by a set of examples (for these examples sets see
http://www-ai.ijs.si/BlazZupan/car.html).
The Car Evaluation Database contains examples with the structural information removed, i.e., directly relates CAR to the six input attributes: buying, maint, doors, persons, lug_boot, and safety. Because of the known underlying concept structure, this database may be particularly useful for testing constructive induction and structure discovery methods.
1. Load the data
car_eval <- read.csv("Path", header=FALSE) colnames(car_eval)<-c ("buying","maint","doors","persons","lug_boot","safety","class") head(car_eval)
## buying maint doors persons lug_boot safety class
## 1 vhigh vhigh 2 2 small low unacc
## 2 vhigh vhigh 2 2 small med unacc
## 3 vhigh vhigh 2 2 small high unacc
## 4 vhigh vhigh 2 2 med low unacc
## 5 vhigh vhigh 2 2 med med unacc
## 6 vhigh vhigh 2 2 med high unacc
2. Exploratory Data Analysis
summary(car_eval)
## buying maint doors persons lug_boot safety ## high :432 high :432 2 :432 2 :576 big :576 high:576 ## low :432 low :432 3 :432 4 :576 med :576 low :576 ## med :432 med :432 4 :432 more:576 small:576 med :576 ## vhigh:432 vhigh:432 5more:432 ## class ## acc : 384 ## good : 69 ## unacc:1210 ## vgood: 65
str(car_eval) ## 'data.frame': 1728 obs. of 7 variables: ## $ buying : Factor w/ 4 levels "high","low","med",..: 4 4 4 4 4 4 4 4 4 4 ... ## $ maint : Factor w/ 4 levels "high","low","med",..: 4 4 4 4 4 4 4 4 4 4 ... ## $ doors : Factor w/ 4 levels "2","3","4","5more": 1 1 1 1 1 1 1 1 1 1 ... ## $ persons : Factor w/ 3 levels "2","4","more": 1 1 1 1 1 1 1 1 1 2 ... ## $ lug_boot: Factor w/ 3 levels "big","med","small": 3 3 3 2 2 2 1 1 1 3 ... ## $ safety : Factor w/ 3 levels "high","low","med": 2 3 1 2 3 1 2 3 1 2 ... ## $ class : Factor w/ 4 levels "acc","good","unacc",..: 3 3 3 3 3 3 3 3 3 3 ...
3) Classification Analysis
library(VGAM) #Build the model model1<-vglm(class~buying+maint+doors+persons+lug_boot+safety,family = "multinomial",data=car_eval) #Summarize the model summary(model1)
#Predict using the model x<-car_eval[,1:6] y<-car_eval[,7]
probability<-predict(model1,x,type="response") car_eval$pred_log_reg<-apply(probability,1,which.max) car_eval$pred_log_reg[which(car_eval$pred_log_reg=="1")]<-levels(car_eval$class)[1] car_eval$pred_log_reg[which(car_eval$pred_log_reg=="2")]<-levels(car_eval$class)[2] car_eval$pred_log_reg[which(car_eval$pred_log_reg=="3")]<-levels(car_eval$class)[3] car_eval$pred_log_reg[which(car_eval$pred_log_reg=="4")]<-levels(car_eval$class)[4] #Accuracy of the model mtab<-table(car_eval$pred_log_reg,car_eval$class) library(caret) confusionMatrix(mtab)
The post Implementation Of Classification Algorithms appeared first on StepUp Analytics.
]]>