# Support Vector Machine

**Table of Contents**

- Introduction
- How does Support Vector Machine work (SVM) ?
- Kernel trick
- Implementing SVM in Python
- Advantages and Disadvantages
- Applications

**Introduction**

Support Vector Machine (SVM) is a popular supervised machine learning algorithm which is used for both classification and regression. But it is mostly used for classification tasks. An SVM model is a representation of various data points in space such these points can be grouped into different categories by a clear gap between them that is as wide as possible. It looks at the extremes a.k.a support vectors (marked in the following figure) of datasets and draws a boundary which is known as hyper-plane.

When data is unlabelled, supervised learning is not possible. As a result, an unsupervised learning approach is used which attempts to find the natural clustering of data to form groups. The support vector clustering applies the statistics of support vectors to categorize the unlabelled data. This is one of the most widely used algorithms in industrial applications.

**How does Support vector machine (SVM) work?**

Let’s say you have some sample points in 2D space. Now you want to classify the stars and the circles with a hyperplane.

Basically, you can classify them perfectly using any of these 3 planes namely 1, 2, 3. But is there any systematic way to choose the right plane among them? The answer is YES!

The thumb rule is you need to identify the hyperplane that has the maximum distance between the nearest data points of either class. This distance is called Margin. Even if you draw any other plane parallel to hyper-plane 2 on either of its sides, its margin would be less as compared to that of hyper-plane 2. So the hyper-plane 2 is the correct choice.

**Note: **

- SVM selects a hyperplane in such a way that it classifies the objects accurately prior to maximizing margin. Hyper-plane 2 classifies all objects accurately whereas hyper-plane 1 has classification error. Hence hyper-plane 2 is the right plane.
- SVM is robust to outliers. It ignores the outliers and finds a plane with maximum margin.
Till now what we saw so far is linear support vector machine. These clusters could be separated linearly. But what if there exists a non-linear data set and you couldn’t separate them into different clusters using a hyperplane. Suppose you have a dataset like this. It looks impossible to separate them into two clusters using a hyper-plane keeping in mind the computational cost.

Here you can use the kernel function to convert the above data points into higher dimensional space. You can simply apply a polynomial function to convert it into a parabola function where the data points can be easily be separated using a single hyperplane as shown in the following figure.

Hence you can convert the 1D data points to 2D data points and also 2D data points to 3D data points. But the computational cost is high.

**Kernel Trick**

Kernel trick is like a magic wand which will boil down a complex non-separable data points into a simpler form at the same time it can minimize the computational cost. It takes input vectors in original space and returns the dot product of the vectors in feature space.

You can apply the dot product between two vectors. So that every point is mapped into higher dimensional space by some transformation. It is a technique in machine learning to avoid some intensive computation in some algorithms, which makes some computation goes from infeasible to feasible.

In the following input space, the red and the blue data points have been separated using a complex computational boundary. To minimize this computational cost, it has been transformed into a higher dimensional feature space (2D to 3D) where data points could be easily be separated into different clusters using a hyperplane.

Some popular kernel function are:

- Polynomial kernel
- Gaussian Radial basis function (RBF) kernel
- Gaussian kernel
- Laplace RBF kernel
- Sigmoid kernel etc.

No matter which kernel you use, it is important to tune its parameters

**Implementing SVM in Python**

1 2 3 4 |
#download dataset from IPython.display import HTML HTML('<iframe src=https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data </frame>') |

1 2 3 |
import pandas as pd import numpy as np |

1 2 |
import matplotlib.pyplot as plt |

1 2 3 4 |
#import dataset from sklearn.datasets import load_iris iris=load_iris() |

1 2 3 4 5 6 7 8 9 10 11 |
print(iris.target) print(iris.target_names) ### Output ### [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2] ['setosa' 'versicolor' 'virginica'] |

Storing them int different objects

1 2 3 |
X=iris.data y=iris.target |

Partition into test and train data

1 2 3 |
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test=train_test_split(X, y, test_size=0.40) |

1 2 3 4 5 6 7 8 9 10 |
from sklearn.svm import SVC model=SVC() model.fit(X_train, y_train) ## Output ### SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False) |

1 2 3 4 5 6 7 8 |
#predict result=model.predict(X_test) print(result) ### Output ### [2 2 1 1 2 1 0 0 0 2 1 0 2 2 0 2 2 1 2 0 1 0 0 2 1 2 0 0 2 1 0 1 2 2 0 2 2 2 1 1 1 1 1 1 0 2 1 0 1 2 1 1 2 0 2 0 0 0 0 0] |

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
#classfication report and confusion matrix from sklearn.metrics import classification_report, confusion_matrix print(confusion_matrix(y_test, result), 'n', classification_report(y_test, result)) ### Output ### [[20 0 0] [ 0 18 2] [ 0 1 19]] precision recall f1-score support 0 1.00 1.00 1.00 20 1 0.95 0.90 0.92 20 2 0.90 0.95 0.93 20 avg / total 0.95 0.95 0.95 60 |

Finding the best parameters value using a grid search

1 2 3 4 5 6 7 |
from sklearn.grid_search import GridSearchCV #finding best combination of C and gamma parameter_grid = {'C':[0.1,1,10,100,1000], 'gamma':[1,0.1,0.01,0.001,0.0001]} grid=GridSearchCV(SVC(),parameter_grid,verbose=3) grid.fit(X_train, y_train) |

**Output**

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
Fitting 3 folds for each of 25 candidates, totalling 75 fits [CV] C=0.1, gamma=1 .................................................. [CV] ......................... C=0.1, gamma=1, score=1.000000 - 0.0s [CV] C=0.1, gamma=1 .................................................. ... .. .. [Parallel(n_jobs=1)]: Done 75 out of 75 | elapsed: 0.2s finished GridSearchCV(cv=None, error_score='raise', estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False), fit_params={}, iid=True, n_jobs=1, param_grid={'C': [0.1, 1, 10, 100, 1000], 'gamma': [1, 0.1, 0.01, 0.001, 0.0001]}, pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=3) |

C parameter controls the cost of misclassification. large c value gives the low bias and high variance.

1 2 3 4 |
grid.best_params_ {'C': 1, 'gamma': 1} |

1 2 3 4 5 6 7 |
grid.best_estimator_ SVC(C=1, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma=1, kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False) |

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
grid_predictions=grid.predict(X_test) print(confusion_matrix(y_test, grid_predictions)) print('n') print(classification_report(y_test, grid_predictions)) [[20 0 0] [ 0 18 2] [ 0 2 18]] precision recall f1-score support 0 1.00 1.00 1.00 20 1 0.90 0.90 0.90 20 2 0.90 0.90 0.90 20 avg / total 0.93 0.93 0.93 60 |

**Advantages and Disadvantages**

**Advantages: **

- It is a robust model to solve prediction problems.
- It works effectively even if the number of features is greater than the number of samples.
- Non-Linear data can also be classified using customized hyper-planes built by using kernel trick.

**Disadvantages:**

- Choosing the right kernel is difficult sometimes.
- SVM can be extremely slow in the test phase.
- High algorithm complexity and huge memory requirement due to quadratic programming.
- when (number of samples) > (number of features), it gives poor results.

**Applications**

**Face detection:**Support vector machine (SVM) can classify a image as face or non-face.**Bioinformatics:**protein classification and cancer classification.**Text and hypertext classification:**classifies natural text or hypertext documents based on their content like email filtering.**Handwriting detection:**Support vector machine (SVM) used to identify handwritten characters that use for data entry and validating signatures on documents.**Image Classification**

For further studies, latest updates or interview tips on data science and machine learning, subscribe to our emails.