The post Classical Time Series Forecasting Using Python appeared first on StepUp Analytics.

]]>**Components of a Time Series**

**1) Trend** We can define change as a general direction in which a specific thing develops or changes. A trend can be increasing or decreasing. Example: We can see an increasing trend in the population of India over time.

**2) Seasonality** The repetition of a pattern at regular intervals of time is called Seasonality.

**1) Hypothesis Generation** This is done before taking a look at the data. We generate a hypothesis based on all the previous knowledge we have about the data. This helps in generalization about which variables will really affect the forecast and how will it affect.

**2) Data Analysis and Hypothesis Verification** We now explore the data, analyze it and check whether our hypothesis holds true based on our analysis.

**3) Forecasting** We now have a hypothesis and we have analyzed the data. We will now move on to the forecasting using various methods.

The traditional statistical methods used for Time Series are:

**1) AutoRegressive Model(AR)** It is a time series model that uses observations from previous time steps as input to make predictions for the future. The autoregressive model specifies that the output variable depends linearly on its own previous values.

**2) Moving Average Model(MA)** Rather than using past values of the forecast variable in a regression, a moving average model uses past forecast errors in a regression-like model. yt=c+εt+θ1εt−1+θ2εt−2+⋯+θqεt−q, where εt is white noise. We refer to this as an MA(q) model, a moving average model of order q. Of course, we do not observe the values of εt, so it is not really a regression in the usual sense.

**3) Simple Exponential Smoothing(SES)** Single Exponential Smoothing, SES for short, also called Simple Exponential Smoothing, is a time series forecasting method for univariate data without a trend or seasonality.

**4) AutoRegressive Integrated Moving Average Model(ARIMA)** In statistics and econometrics, and in particular in time series analysis, an autoregressive integrated moving average (ARIMA) model is a generalization of an autoregressive moving average (ARMA) model. Both of these models are fitted to time series data either to better understand the data or to predict future points in the series (forecasting). ARIMA models are applied in some cases where data show evidence of non-stationarity, where an initial differencing step (corresponding to the “integrated” part of the model) can be applied one or more times to eliminate the non-stationarity.

Get the code and data used in the above article GitHub

**Now we’ll see the implementation of these models on a Time Series Data using Python**

In this forecasting technique, we assume that the next expected point is equal to the last observed point.

In this model, y(t) depends only on past values y(t-1), y(t-2), etc.

It works well when there is high correlation between past and current values

It depends only on the random errors.

In this technique, we assign larger weights to more recent observations than to observations from the distant past. The weights decrease exponentially as observations come from further in the past, the smallest weights are associated with the oldest observations. If we give the entire weight to the last observed value only, this method will be similar to the naive approach. So, we can say that naive approach is also a simple exponential smoothing technique where the entire weight is given to the last observed value.

It stands for AutoRegression Integrated Moving Average.

**It is specified by three ordered parameters (p,d,q):**

- p: the order of the autoregressive model(number of time lags).
- d: the degree of differencing(number of times the data has had past values subtracted).
- q: the order of moving average model. We will discuss more about these parameters in next section.

Read more blogs on Python for Data Science Link

Get the code and data used in the above article GitHub

The post Classical Time Series Forecasting Using Python appeared first on StepUp Analytics.

]]>The post Bayesian Inference in Python appeared first on StepUp Analytics.

]]>Bayes Theorem uses prior knowledge or experience to provide better results. Mathematically speaking, it uses conditional probability of an event

Bayes’ theorem is stated mathematically as the following equation:

**P(A/B) = P(B/A)P(A)/P(B)**

where

- A and B are events and P(B) not equal to 0.
- P(A) and P(B) are the probabilities of observing A and B independently of each other.
- P(A/B) is the conditional probability i.e. the probability of occurance of A given that B is true (has already occurred).
- P(B/A) is the conditional probability of B given that A is true.

Bayesian Inference is a method of Statistical Inference where we update the probability of our hypothesis(prior) H, as more information(data) D becomes available and finally arrive at our posterior probability. Bayes Theorem lays down the foundation of Bayesian Inference. Mathematically Expressing, we have

**P(H/D) = P(H)P(D/H)/P(D)**

where

- P(H) is the probability of the hypothesis before we see the data, prior
- P(H/D) is what we want to compute, the probability of the hypothesis after we see the data, posterior
- P(D/H) is the probability of the data under the hypothesis.
- P(D) is the probability of the data under any hypothesis.

In the above coin toss example, we have the probability of heads coming up to be 0.35. We assume our prior-the probability distribution of heads coming up to be Uniform(0,1) Distribution. And then on the basis of data, we arrive at our posterior distribution of heads coming up. What we notice is that our distribution gets more accurate as we increase the number of trials.

Now, assuming prior to being Bernoulli(0.5) Distribution and combining our result with our previous prior we have

**INFLUENCE OF THE PRIOR:**

The prior influences the result of our analysis. As it is obvious from the above example, the influence of the prior is more dominant when data-volume is less. The prior eventually subsidize as the number of trials becomes larger (where using frequentist’s inference methods might be a better option).

**HOW TO CHOOSE A PRIOR:**

This is quite a subjective question. Some do use a non-informative prior (such as Uniform(0,1) in our first example). Unless and until we are quite sure, it is not recommended to use strongly informative priors in our analysis.

The resultant probability distribution which summarizes both prior and the data is the posterior.

**HIGHEST POSTERIOR DENSITY (HPD) INTERVAL**

A highly useful tool to summarize the spread of the posterior density. It is defined as the shortest interval containing a given portion of probability density.

There are several ways to compute Posterior computationally which can be broadly classified as

- Non-Markovian Methods
- Markovian Methods

**NON-MARKOVIAN METHODS**

GRID COMPUTING: It is a brute force approach mainly used when we cannot compute the posterior analytically. For a single parameter model, the grid approach is as follows:

- Define a reasonable interval for the parameter (the prior should give you a hint).
- Place a grid of points (generally equidistant) on that interval.
- For each point in the grid, we multiply the likelihood and the prior.
- Normalize the computed values (divide the result at each point by sum of all points).

The Grid Computing approach does not scale well for high-dimension data.

There are other Non-markovian methods as well such as QUADRATIC(LAPLACE) METHOD and VARIATIONAL METHODS.

**MARKOVIAN METHODS: **

There’s a family of methods known as MCMC- MONTE CARLO MARKOV CHAIN Methods. Here as well, we need to compute the prior and the likelihood at each point to approximate the whole posterior distribution. MCMC methods outperform the grid approximation because they are designed to spend more time in higher probability regions than in lower ones.

# Revisiting the coin toss Example**Using PyMC3 Library (Python Library for probabilistic Programming)**

Get more articles on Python for Data Science.

Download the iPython notebook used in the above example from GitHub

The post Bayesian Inference in Python appeared first on StepUp Analytics.

]]>The post Data Science: Python for Data Preprocessing appeared first on StepUp Analytics.

]]>Similarly, we will be preprocessing the data by cleaning it, removing insignificant features and then performing data exploration. These steps comprise of data preprocessing/data wrangling which is mandatory before visualization of data and the fe

Continuing with the first part of this series, we will be looking at different techniques involved in the preprocessing of data. This series of articles will be covering the following topics:-

- Web Scraping
- Data Preprocessing
- Feature Engineering and Model Selection

We will look into the following topics which are covered under preprocessing of raw data:-

**Data Preprocessing****Data Exploration****Identifying variables**

**Data Cleaning****Dropping Features****Missing Values Identification****Formatting the data.**

**Data Visualization/ Exploratory Data Analysis**

**Text data preprocessing**

Data preprocessing involves a collection of steps which helps to purify the data and extract the useful and remove the insignificant information. Data obtained from real-world is incomplete, inconsistent and it also contains numerous errors. Thus to counter this issue with the data, we are using data preprocessing which aids in removing discrepancies in names and related problems.

In the real world, we generally have two broad categories of data. In the first kind of data, we have continuous and categorical features and then in the second kind of data we have the text data. So these two types of data require different steps for preprocessing. So first let’s have a look at the steps for continuous and categorical features.

To understand the following steps of data handling and visualization we will be using two different datasets which are available on Kaggle. The first dataset is regarding the information of Cities of India and the other dataset is related to Online Shopping.

Initially, after loading the dataset, we have to study the dataset which is used by us for gathering the general insights which data possess. To gather these insights, we look at the continuous features. For this we will use the **head(), info()** and **describe()** function of pandas.

The **head()** function helps us to know the columns and rows contained in the dataset. By default, the **head****()** function displays the first 5 values of the dataset.

After looking at some initial values, we can determine the shape i.e. number of columns and rows contained in the dataset by using a **shape** attribute of the dataframe.

The **Info()** function tells us the count of rows which consists of values in the given number of columns. By comparing the rows displayed through shape attribute and **info()** function, we can predict if there are any cells with **NaN** values. In the above example, we have no empty cells as all the 493 rows consist of some values. Here we can also know the data type of all the columns of the dataset.

We can look at the variables/columns of the data which can act as predictor variables and target variables. So this can build a connection between such variables. In the mentioned example, state_name and name_of_city are the predictor variables whereas other columns can be the target variables.

Moreover, we can also remove the columns which are of no significance and cannot provide any sort of insight into the dataset. For example, here in this dataset, we will remove the **state_code**, **dist_code** and **location** are not of any use. So we will remove them from our dataframe so that these columns do not hinder the results of other operations.

The **drop()** function removes the columns from the dataframe and **axis = 1** parameter specifies that we are wanting to remove the columns.

In univariate analysis, we perform the analysis of variables based upon their type i.e. either they will be continuous or categorical. If we have continuous variables which are also the case with a dataset which we have, we can use **describe() ** function. Through this function, we can discern the different measures which reveal the central tendencies and spread of the variable.

Here the values which **describe** function displays are count, mean, standard deviation, minimum and maximum value along with the three quartile values. For categorical values, we can use a frequency table for a better understanding of different categories.

As the column ‘**state_name**’ is categorical, we can find the frequency of different states in the dataset. This is done by using **value_counts()** function.

Now after having gone through the data, we will be looking to clean the dataset by removing erroneous values, redundant data. Also, we would want to remove the columns with missing values. Moreover, it is suggested to nullify the impact of outliers as well.

For performing data cleaning, we would be using a different dataset which is related to Online Shopping.

When it comes to data cleaning, we focus onto three main steps which are as follows:-

As we had a look earlier, we removed three columns from the dataset as they were of no significance. Similarly, if there are some more columns which have a large number of missing columns, then it is recommended to drop that column comprehensively. Along with this, at this stage, we perform changes which are necessary to make the data apt for preprocessing steps.

As we can see the column names are starting with capital letters, thus we would want them to be in lowercase so that it does not hinder in the proceedings.

By using **rename()** function, we will be able to rename the column names as per our requirement.

We can clearly see that the column names have been changed into

Managing the missing values in the dataset plays a crucial part in data preprocessing. If we do not handle the missing values, then we can get misleading results. First, for checking missing values, we can use the following code snippet.

In the above code snippet, we look at null values and sort them in ascending order. We can see that the cust_id column has the most number of missing values.

In this code snippet, we look at the rows with missing values. Thus after knowing this, we can decide how we will handle these missing values. The different ways of managing missing values are as follows:-

- We can fill these missing values with random values like ‘0’
- Ignoring missing values, if they are less in number
- Another way is to fill these missing rows with mean, mode or median of the columns. This method is more preferred than the other two methods.

By using the **dropna****()** function, we can remove the rows with empty values and then store it in a new dataframe.

When we check for the missing values, we can see that there are no missing values.

Formatting of data involves making the data types compatible with other data types of the columns, removing abbreviations and also creating new columns by using values from existing columns. In this example, we will be formatting the following things.

If we look in the above example of this dataset, the description is present in the upper case. Thus to change it, we will be using the **lower()** function to make the description to lower case.

In the above snippet, we can see cust_id column is of float type and to change this, we will be using the code shown above.

Here in this picture, the cust_id data type is now of integer type. Therefore, this shows how to change the data type as per our needs.

After completing the preprocessing of data, the next step is to perform the visualization of data. This is also known as Exploratory Data Analysis. We will use both the datasets for visualization and getting insights from them. First, let’s look at some visualizations from the cities dataset.

This snippet code displays the top 5 cities with the highest population with all the column values.

For obtaining states with the highest population, we will be using the ** groupby()** function and then plotting by using the

Clearly discernible from this plot that states like Uttar Pradesh, Maharashtra has the highest population whereas states like Meghalaya, Mizoram, and Nagaland have the lowest population.

Now let’s look at some of the visualizations of the other dataset.

This displays the highest money spending customers from different countries.

The above visualization helps us to know the number of orders on different days of the week. There can be more visualizations as well. It depends on our creativity and curiosity about what we want to know from the dataset.

Most of the data obtained from the websites and other sources are text data and thus, it is required to process them in a different manner. I have covered text data preprocessing which was regarding Natural Language Processing.

You can have a look at the article and know what are the steps required for text data preprocessing. Generally, the steps involved for preprocessing text data is discussed as follows:-

**Converting text to lowercase:**It is recommended because this removes anomalies between identical words which are present in both upper and lowercase.

**Remove Numbers:**It is a tedious task to process text data with numbers, so we remove numbers from it.

**Removing punctuation and special letters:**All the punctuation and special letters are providing no information. So we always drop them from text.

**Removing stop words:**These stop words are the
prepositions like a, an, the etc. which are in abundance in data but are of no
use. So they are also removed.

After the above steps, then we have options of some specific steps which can be taken as per our requirements. The Jupyter notebooks for this article can be referred from here.

The post Data Science: Python for Data Preprocessing appeared first on StepUp Analytics.

]]>The post Install Python on Windows and Mac (Anaconda) appeared first on StepUp Analytics.

]]>**1.**Download and install Anaconda (windows version) from

Choose either the Python 2 or Python 3 Version depending on your needs. It doesn’t affect the installation process. Download Link

**2.** Select the default options when prompted during the installation of Anaconda.

Note: If you checked this box, steps 4 and 5are not needed. The reason why it isn’t preselected is a lot of people don’t have administrative rights on their computers.

**3. **After you finished installing, open **Anaconda Prompt**. Type the command below to see that you can use a Jupyter (IPython) Notebook.

If you want a basic tutorial going over how to open Jupyter and using python, please see the video below.

Windows OS: YouTube Video

Mac OS: YouTube Video

**4.** If you didn’t check the add Anaconda to path argument during the installation process, you will have to add python and conda to your environment variables. You know you need to do so if you open a **command prompt** (not anaconda prompt) and get the following messages

**5.** This step gives two options for adding python and conda to your path (only choose 1 option).

If you don’t know where your conda and/or python is, you type the following commands into your **anaconda prompt**

Happy Learning!

The post Install Python on Windows and Mac (Anaconda) appeared first on StepUp Analytics.

]]>The post Web Scraping Tutorial Using Python – Part 1 appeared first on StepUp Analytics.

]]>Above analogy is applicable to the ubiquitous data too. Most of the times we can get the data from various sources like kaggle etc. but there are scenarios where we need customized data. For this, we have to choose the path of web scraping i.e. **getting the data from websites** using either the API’s provided or through python and its libraries.

Once done with the step of getting the data, we would be required to clean and handle it. Thus making it appropriate for the extraction of information from it. At last, we would be infusing flavors into it i.e. getting the features of the data for information extraction.

In this article Web Scraping using Python, We will be covering a series of articles where all the data preparation steps will be covered which are as follows:

- Data Collection –
**Web Scraping** - Data Cleaning
- Data Handling and Feature Extraction.

In this first article, we will be learning about Web Scraping. The points covered in this article given below:-

- What is Web Scraping?
- Different ways opted for Web Scraping.
- Libraries used for Web Scraping.
- Practical Implementation of Web Scraping.
- Through BeautifulSoup
- Through Scrapy

**What is Web Scraping?**

Web scraping is a way to extract the information from web pages which is present in HTML format. The data is present in an unstructured format, so web scraping helps to get this data along with this we can convert it into a structured format.

**Different ways opted for Web Scraping.**

There are numerous ways through which we can scrape the web. Some of them are as follows:-

**Using API’s:**Many websites/organizations provide API for extracting data from their website. But there are a lot of limitations to the kind of data which can be extracted.**Through in-built libraries/frameworks of Python:**Python is a home of many libraries for distinguished tasks and web scraping can also be achieved using those libraries. Moreover, there are frameworks as well which facilitate this process.

It’s time to have a look at the libraries which are used for web scraping.

**BeautifulSoup**: This library is used to extract information from the webpage which is present in various tags like table, paragraph etc.**Urllib**: Using this library we extract the webpages through the URL of the desired webpages.**Requests**: This is another library used for obtaining the URL’s from the webpage.**Scrapy**: Scrapy is a collaborative and open-source python framework which is used for large scale web scraping.

**Web scraping through Beautiful Soup.** Here we will be scraping the web through the Beautiful Soup library. For scraping purposes, we are using a weather forecast website. We will be scraping the weather forecast data of **San Francisco. **So let’s start this journey!!!!

First, we will be importing BeautifulSoup library as bs4 and requests library which is used for extracting URL.

Using the requests library we are downloading the desired web page which consists of URL of the web page, latitude, and longitude of the city i.e. San Francisco.

After downloading the page, we will be parsing the HTML content using BeautifulSoup.

Forgetting the forecast from the web page, we will have to **inspect** the webpage and recognize the ‘id’ tag value and assign it to a variable.

Our next task is to get the class attribute of the **‘id’** tag and we assign this to forecast_items variable. Here the **find_all ()** function gets all the class attributes of the page.

We are printing the HTML code and then using **prettify ()** function to get in a structured manner.

Next task is to identify the class attributes for extracting more information. The three variables used are **period**, **short description **and **temperature**.

For obtaining the title of the forecasts, we are using the ‘title’ attribute of ‘img’ tag. After obtaining data, we will be using prettify () to get the data structured and then printing the title.

We have to iterate over period tags for getting the period names of further days. Here we have used list comprehension for this.

Furthermore, using the ‘tombstone-container’ we are extracting the short description, temperature, and description of different days of forecast.

For better and clear representation of data, we will be mapping the values to a dataframe.

Therefore, this dataframe is the final result which is obtained through this web scraping.

Coming to another way of scraping, we will be using Scrapy framework. Scrapy creates new classes called Spider that define how a website will be scraped by providing the starting URLs and what to do on each crawled page.

Using scrapy we will be scraping URL of the images of headphones from amazon.com and getting them in text file.

**For
installing scrapy.**

**: pip install scrapy **

Let’s start scraping with scrapy.

First, we have to open a command prompt and then create a new project by using

Here ‘startproject’ command creates a new folder with a name as ‘headphones’. This folder will contain 4 already created files which are **items.py, settings.py, middleware.py **and** pipelines.py** and these are required for creating a spider. These files can be customized if required.

Now we will create ‘spider’ by ‘genspider’ command where we specify the name of *https://amazon.com*

**NOTE: **Keep in mind that the project name and spider name should be different.

**Output:**

We are starting with the most basic scrapper python class which is using scrapy. The Spider which is spider class provided by Scrapy. Here we are using the name of the spider and then using the **init ()** function.

In the **start_requests ()** function we are specifying the URL which is to be crawled. Then we iterate over each URL and we yield the URL’s using **Request()** function of scrapy.

In this **parse ()** function, we are extracting the URL’s of the images through the ‘

Continuing this parse () function, we are using the try and except block. The try part is used to get the next link which is present in ‘span’ tag and yielding the links which are followed.

Lastly, in except block we check if there are no more links available. Then we are creating a file i.e. text file which consists of URL’s of images. Here we are converting the URL’s into strings. The code for this article can be found here.

The post Web Scraping Tutorial Using Python – Part 1 appeared first on StepUp Analytics.

]]>The post Workshop On Data Science Using Python And Advance Excel appeared first on StepUp Analytics.

]]>A good Data Scientist is a passionate coder statistician, and there’s no better programming language for a statistician to learn than Python. It’s a popular skill among Big Data analysts and Data Scientists most sought after by some of the biggest brands. In addition, Python’s commercial applications increase by the minute, and companies appreciate its versatility.

Python is an open-source and freely available. Unlike SAS or Matlab, you can freely install, use, update, clone, modify, redistribute and resell Python.

- Python is a powerful scripting language. As such, R can handle large, complex data sets. R is also the best language to use for heavy, resource-intensive simulations and it can be used on high-performance computer clusters.
- Python is cross-platform compatible. R can be run on Windows, Mac OS X, and Linux. It can also import data from Microsoft Excel, Microsoft Access, MySQL, SQLite, Oracle, and other programs.
- Python To help the participants to familiarize with the R Programming language and data analysis.
- Python To train the participants in accessing data from various domains.
- To develop confidence as an independent Python Programmer, and data analyst.

**I. Faculty of science**

- Department of statistics
- Department of mathematics
- Masters in computer application (MCA)
- Department of remote sensing.

**II. Faculty of arts**

- Department of economics
- Department of psychology

**III. Faculty of engineering**

**IV. Faculty of management**

**V. Anyone who wants to learn and use python or wants to up-skill them self are welcome to this workshop**

**Name**: Muquayyar Ahmed**Education:** BTech (Electronics and Communication)**Profile Description:** Ahmed has 5 years of experience in Data Science. He has worked on different tools and technologies including Text mining, Data Mining, Image Mining, R and Python. He has executed multiple projects in different domains includes Healthcare, Pharma, BFSI, Telecom etc. Currently working in TCS as Senior Data Scientist”

**UG Department: Click**

**AMU PG & Engineering Department: Click**

**Research Scholar:** **Click**

**Faculty Members: Click**

The post Workshop On Data Science Using Python And Advance Excel appeared first on StepUp Analytics.

]]>The post Workshop On Data Science Using Python And Advance Excel appeared first on StepUp Analytics.

]]>A good Data Scientist is a passionate coder statistician, and there’s no better programming language for a statistician to learn than Python. It’s a popular skill among Big Data analysts and Data Scientists most sought after by some of the biggest brands. In addition, Python’s commercial applications increase by the minute, and companies appreciate its versatility.

Python is an open-source and freely available. Unlike SAS or Matlab, you can freely install, use, update, clone, modify, redistribute and resell Python.

- Python is a powerful scripting language. As such, R can handle large, complex data sets. R is also the best language to use for heavy, resource-intensive simulations and it can be used on high-performance computer clusters.
- Python is cross-platform compatible. R can be run on Windows, Mac OS X, and Linux. It can also import data from Microsoft Excel, Microsoft Access, MySQL, SQLite, Oracle, and other programs.
- Python To help the participants to familiarize with the R Programming language and data analysis.
- Python To train the participants in accessing data from various domains.
- To develop confidence as an independent Python Programmer, and data analyst.

**I. Faculty of science**

- Department of statistics
- Department of mathematics
- Masters in computer application (MCA)
- Department of remote sensing.

**II. Faculty of arts**

- Department of economics
- Department of psychology

**III. Faculty of engineering**

**IV. Faculty of management**

**V. Anyone who wants to learn and use python or wants to up-skill them self are welcome to this workshop**

**Name**: Muquayyar Ahmed**Education:** BTech (Electronics and Communication)**Profile Description:** Ahmed has 5 years of experience in Data Science. He has worked on different tools and technologies including Text mining, Data Mining, Image Mining, R and Python. He has executed multiple projects in different domains includes Healthcare, Pharma, BFSI, Telecom etc. Currently working in TCS as Senior Data Scientist”

**AMU Students: Click**

**Research Scholar:** **Click**

**Faculty Members: Click**

The post Workshop On Data Science Using Python And Advance Excel appeared first on StepUp Analytics.

]]>The post Data Science Solution Using Titanic Dataset appeared first on StepUp Analytics.

]]>The sinking of the Titanic is one of the most infamous shipwrecks in history. On 15 April 1912, Titanic made its first voyage. The ship sank after it collided with an iceberg. About 1500 people were killed during this incident. One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

The basic workflow to solve any data science problem is as follows :

- Identifying the problem
- Acquire test and training data
- Clean the data
- Analyze the data
- Model, predict and solve the problem
- Visualize the data and come up with a solution

But here our goal is to get a generalized prediction as fast as possible. But this doesn’t mean to avoid the exploratory data analysis (EDA).

Before you begin I recommend you to read about the Random Forest Algorithm first as in this tutorial we are gonna use random forest algorithm only. But you can also implement it using any other algorithms like logistic regression, decision tree etc.

#An efficient data structure import pandas as pd #importing the data X = pd.read_csv("C:/Users/DELL/Downloads/train.csv") X.describe()

y=X.pop("Survived") X.head()

#Impute age with mean value X["Age"].fillna(X.Age.mean(), inplace=True) X.describe()

#selecting only the numeric variables numeric_variables=list(X.dtypes[X.dtypes!="object"].index) X[numeric_variables].head()

#importing Random Forest Classifier from sklearn.ensemble import RandomForestRegressor #instantiate parametrs model = RandomForestRegressor(n_estimators=100, oob_score=True, random_state=42) #fit the model model.fit(X[numeric_variables], y) ############################# ########## Output ########### ############################# RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1, oob_score=True, random_state=42, verbose=0, warm_start=False)

Having a high value for n_estimators increases the number of trees which improves the prediction rate

#prediction error #oob score gives the R^2 value based on oob predictions model.oob_score_ #### Output #### 0.1361695005913669

#finding C-stat value from sklearn.metrics import roc_auc_score y_oob=model.oob_prediction_ print("C-stat", roc_auc_score(y, y_oob)) #### Output #### C-stat 0.7399551550399983

Now we have a benchmark which can be further improved

#function to show stats on categorical variables def describe_categorical(X) : from IPython.display import display, HTML display(HTML(X[X.columns[X.dtypes == "object"]].describe().to_html()))

describe_categorical(X)

#dropping variables which are not relevant X.drop(["Name", "Ticket", "PassengerId"], axis=1, inplace=True)

categorical_variables=["Sex", "Cabin", "Embarked"] for variable in categorical_variables : #filling missing data with "Missing" X[variable].fillna("Missing", inplace=True) #create array of dummies dummies= pd.get_dummies(X[variable], prefix=variable) X=pd.concat([X, dummies], axis=1) #drop the main variable X.drop([variable], axis=1, inplace=True)

#print all columns def printall(X, max_rows=10): from IPython.display import display, HTML display(HTML(X.to_html(max_rows=max_rows))) printall(X)

model=RandomForestRegressor(100, oob_score=True, n_jobs=1, random_state =42) model.fit(X, y) print("C-stat : ",roc_auc_score(y, model.oob_prediction_)) #### Output #### C-stat : 0.8641256298000618

feature_importances= pd.Series(model.feature_importances_, index=X.columns) #feature_importances.sort() feature_importances.plot(kind="bar", figsize=(40,10));

- parameters to improve the model
- n_estimators = number of trees in the forest
- max_features = number of features considered for best a split
- min_samples_leaf = minimum number of samples in newly created leaves
- n_jos = multiple processors that can be used to train and test the model

##n_estimators : finding the optimal number of trees results=[] n_estimator_values=[10,20,50,100,150,200] for trees in n_estimator_values: model=RandomForestRegressor(trees, oob_score=True, n_jobs=-1,random_state=42) model.fit(X, y) print(trees, "trees") roc_score=roc_auc_score(y, model.oob_prediction_) print("C-stat : ", roc_score) results.append(roc_score) print(" ") pd.Series(results, n_estimator_values).plot();

10 trees

C-stat : 0.8274933691240853

20 trees

C-stat : 0.8562218387498801

50 trees

C-stat : 0.8620644659615038

100 trees

C-stat : 0.8641256298000618

150 trees

C-stat : 0.8635770513107297

200 trees

C-stat : 0.8650230616005709

#finding the best execution time %%timeit model=RandomForestRegressor(200, oob_score=True, n_jobs=-1,random_state=42) model.fit(X, y)

661 ms ± 87.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit model=RandomForestRegressor(200, oob_score=True, n_jobs=1,random_state=42) model.fit(X, y)

920 ms ± 188 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

#finding optimal value for max_features results=[] max_features_values=["auto", "sqrt", "log2", None, 0.2, 0.9] for max_features in max_features_values: model=RandomForestRegressor(n_estimators=200, oob_score=True, n_jobs=-1,random_state=42, max_features=max_features) model.fit(X, y) print(max_features, "option") roc_score=roc_auc_score(y, model.oob_prediction_) print("C-stat : ", roc_score) results.append(roc_score) print(" ") pd.Series(results, max_features_values).plot(kind="barh", xlim=(0.86, 0.9));

auto option

C-stat : 0.8650230616005709

sqrt option

C-stat : 0.8665516249640495

log2 option

C-stat : 0.8673851447075491

None option

C-stat : 0.8650230616005709

0.2 option

C-stat : 0.8639365566314086

0.9 option

C-stat : 0.8643573110067215

#finding the optimal values for min_samples_leaf results=[] min_sample_leaf_values=[1,2,3,4,5,6,7,8,9,10] for min_sample in min_sample_leaf_values: model=RandomForestRegressor(n_estimators=200, oob_score=True, n_jobs=-1,random_state=42, max_features="log2", min_samples_leaf=min_sample) model.fit(X, y) print(min_sample, "min sample") roc_score=roc_auc_score(y, model.oob_prediction_) print("C-stat : ", roc_score) results.append(roc_score) print(" ") pd.Series(results, min_sample_leaf_values).plot();

1 min sample

C-stat : 0.8673851447075491

2 min sample

C-stat : 0.8620831069781316

3 min sample

C-stat : 0.8419135269868661

4 min sample

C-stat : 0.8385048839463566

5 min sample

C-stat : 0.8371147967063985

6 min sample

C-stat : 0.8389096603074169

7 min sample

C-stat : 0.8339564758891764

8 min sample

C-stat : 0.8334292014188476

9 min sample

C-stat : 0.8322201983404169

10 min sample

C-stat : 0.8326249747014774

#optimised model model=RandomForestRegressor(n_estimators=200, oob_score=True, n_jobs=-1,random_state=42, max_features="log2", min_samples_leaf=1) model.fit(X, y) roc_score=roc_auc_score(y, model.oob_prediction_) print("C-stat : ", roc_score)

C-stat : 0.8673851447075491

As you can clearly observe that the prediction result improved from 0.73 to 0.86 which is fair enough.

You can also check for the prediction accuracy by implementing other machine learning algorithms and compare them to select the best among them.

The post Data Science Solution Using Titanic Dataset appeared first on StepUp Analytics.

]]>The post Introduction To Python For Data Visualization With Seaborn appeared first on StepUp Analytics.

]]>Data visualization which helps us to present our analysis from any data which we analyze is primarily performed using Matplotlib which is a very strong and comprehensive library for performing such tasks. But one of the points where Matplotlib suffers is the length of the code which can increase to a significant extent while building small representations.

This is where Seaborn comes as our savior. Seaborn is utilized for plotting of some of the most pleasing data visualization representations.

Around the globe, Seaborn is known for its ability to make statistical graphs in Python. Matplotlib is the language which acts as the basic building block for Seaborn along with Pandas.

Salient features which make it so lovable are as follows:

- Seaborn is used for visualizing univariate or bivariate distributions and even for comparing them.
- It simplifies the representations of complex datasets.
- Seaborn provides us with the control over matplotlib’s figure styling through various inbuilt themes which it possesses.
- Through seaborn, we can choose amongst the variety of color palettes for making our plots much more conclusive to the viewer.
- Using seaborn, we have the facility of representation of categorical values.

Overall the usage of data frames and arrays makes seaborn much more efficient with whole datasets and it produces plots which consist of lots of information. One thing we must remember is that Seaborn is not something which can act as a substitute for Matplotlib but it can definitely act as a compliment for Matplotlib and improve upon the lacking points of it, thus presenting a holistic view.

In this tutorial, we’ll look at some of the most essential plots which are built using Seaborn library. All the way through this tutorial, we’ll try to learn the concepts using examples and its explanation.

- What is Data Visualization
- Why Data Visualization
- Installing Seaborn
- Importing libraries and dataset
- Different types of plots/graphs which are build using Seaborn
- Bar Plot
- Box Plot
- Scatter Plot
- Violin Plot
- Swarm Plot
- Overlaying Swarm Plot with Violin plot
- Joint Kernel Density Plot
- Joint Plot
- Heatmap
- Clustermap

**Data Visualization **can be defined as a process of extracting essential information from raw/processed data and then representing it pictorially for better understanding and analysis of the facts/figures. Data Visualization is an amalgamation of two fields i.e. Science and Art, this means we are applying our scientific and artistic skills in the making of any kind of visualizations.

To be more precise, data visualization is a strategy of depicting the quantitative knowledge obtained through various data wrangling processes in a graphical manner.

The main objective of data visualization is to decide the most optimal way of presenting a data set with the data visualization practices in mind and all these tasks become much more difficult when we have to deal with large datasets.

In recent times, the amount of data which is generated is gigantic and we want it to be of some significance. According to reports, the data which revolves around the internet every second is more than were stored in the entire internet in past 20 years.

This is because as people are using the internet for building connections not only amongst themselves but also for connecting machines the data will grow exponentially.

The human mind is not competent enough to deal with such data and thus we need some kind of tool which can help us to understand profoundly what all such data is trying to convey. For these purposes, only data visualization is used since it not only reduces the size of large datasets but is able to represent the useful information very easily in the form of graphs which also aids in delivering the content to the viewer.

There are numerous plots which are used in Data Visualization such as Histograms, Pie Chart, Box Plot, Word Cloud, Scatter plot etc. there is a long list of such graphs and most of them we’ll see with examples very soon in this tutorial.

As we will be working with Seaborn, an inbuilt library of python. We are required to install it on our system on which we are working. Along with Seaborn, as I have worked with Jupyter Notebook for this tutorial I would recommend to use it as well.

You can go for the following installations if you wish:

- Python 2.7/ Python 3(Preferred)
- Python Libraries for Data Visualization
- Pandas
- Matplotlib
- Seaborn

- Jupyter Notebook.

You can download and install **Anaconda Distribution** as it comes along with the packages which are required. You only have to follow the steps on that page.

If you have python and you want to install seaborn in your system, you can execute the following and then it will help you to install the latest version of Seaborn in your system.

sudo pip install seaborn

If you are opting to work upon Jupyter notebook. Then, once you have installed Anaconda, open Jupyter Notebook (either through the command line or navigator app) and create a new notebook.

Initially, we will import Pandas which will be used for handling relational datasets.

#Pandas for managing datasets import pandas as pd

After this, we have to import Matplotlib which will aid in making changes to the plots which we will be created using Seaborn.

#Matplotlib for additional customization import matplotlib.pyplot as plt %matplotlib inline

**NOTE: **While using Jupyter Notebook, we write %matplotlib inline for displaying the plots in the notebook.

Finally, we will import the library which is the base of this tutorial i.e. Seaborn.

#Seaborn as our primary library for plotting and styling import seaborn as sns

Now we are all set to import the dataset which we will be using for Visualization purposes.

**NOTE: **We have used aliases for the imported libraries such as pd for pandas, plt for Matplotlib and sns for Seaborn. This is actually done for our convenience, so that we can invoke these libraries and its methods using these aliases and we will not be required to write the full name of the library.

The dataset for this tutorial is a very interesting one i.e. Pokémon dataset. You can download it for free from here. **Pokemon.csv**

After having downloaded the dataset, remember to keep the dataset (csv file) in the same folder where your python file/ Jupyter notebook is present, as there will no issues of providing the location of the dataset.

Now to import the dataset we have to execute the following code.

# Read the dataset df = pd.read_csv("pokemon.csv", index_col = 0)

**NOTE: **Here the index_col=0 argument tells that we will be using the first column of the dataset as the ID column.

Once we have imported the dataset, we can view the values of the dataset and see how the dataset looks like by the head () function of df (dataframe). We can pass different values to the head () function for viewing more or less values. The default value is 5.

**Step 5****: **Let’s start plotting various graphs and visualizations.

#Count Plot sns.countplot(x='Type 1',data=df)

We use the in-built function of seaborn i.e. countplot() for plotting the bar graph where we have provided the ‘Type 1’ as the value for x-axis and ‘df’ as the value for data.

This is the output we obtain when we execute the above code, where x-axis as Type 1 and y-axis is labelled as count. But you must have noticed that the x-axis values are not visible due to the lack of space. For getting rid of this problem we can try the following.

#Count Plot sns.countplot(x='Type 1',data=df,palette='rainbow') plt.xticks(rotation=70) plt.rcParams['xtick.labelsize'] = 15 plt.rcParams['axes.labelsize'] = 20

Here we have made some additions to the previous code so that we can get a clean output of the bar plot as compared to the previous one. In this, you can see we have used matplotlib’s ‘xticks’ method in which we have set the value of ‘rotation’ as 70 which will tilt the x-axis values by 70 degrees making it clearly visible. Moreover, we have passed another argument in countplot i.e. palette and its value as ‘rainbow’ this will present the color of bars in rainbow colours.

Along with this, we have used ‘rcParams’ method of matplotlib to increase the size of x-axis values and labels of x-axis.

We can clearly see the values of x-axis are presented neatly and the labels of both the axes are also visible clearly.

**Uses:
**A Bar Plot is used to represent a comparison between categories of data. It can be represented either in vertical/horizontal manner.

A box plot is the visual representation statistical five number summary of a given data set i.e. Minimum, First Quartile, Median, Third Quartile and Maximum.

#Boxplot sns.boxplot(data=df) plt.xticks(rotation=90)

For plotting the boxplot we have used the boxplot() function of seaborn, but we can see that for some values the result obtained is insignificant, thus we will have to remove all those columns which are redundant like ‘Total’ as we have the individual stats and the one’s which are not combat i.e. ‘Legendary ‘ and ‘Generation’.

**NOTE: **The black dots which are visible are the outliers in the dataset.

df = df.drop(['Total','Generation','Legendary']) sns.boxplot(data=df)

Now we have removed the ‘#’, ‘Total’, ‘Legendary’ and ‘Generation’ column from the dataset by using the **drop()** method of df and also specified the axis for identifying the x-axis where they are located.

**Uses:
**It is used in exploratory data analysis. Through this, we can represent the shape of the distribution of data through it.

sns.lmplot(x="Sp. Atk",y="Sp. Def", data=df)

Here we have used the lmplot () function of seaborn for creating the scatter plot where we have provided values of x-axis and y-axis as ‘Sp. Atk’ and ‘Sp. Def’ respectively and provided the ‘data’ parameter with value ‘df’

It is evident that we were looking to plot to scatter plot but we have also obtained the regression line. This is because Seaborn has no specific scatter plot function, reason why we see a diagonal line. The function

‘lmplot ()’ is used to fit and plot the regression line.

sns.lmplot(x="Sp. Atk",y="Sp. Def", data=df,hue='Type 1',fit_reg=False)

To solve the regression line issue, we can use ‘fit_reg’ argument and set it as ‘False’. Moreover, we can use the ‘hue’ parameter which will help us to present the points on the graph with much more clarity i.e. we represent the third dimension of information using ‘Type 1’ as the value for ‘hue’.

We can clearly view different generations of Pokémon and their special attack and special defense values through this scatter plot.

**Uses:
**Scatter plot helps us to find potential relationships between values and they can help in detecting outliers in the datasets. They can simplify the large dataset’s representation easily.

#Setting the theme sns.set_style('darkgrid') plt.figure(figsize=(10,4)) #Violin Plot vio_plot = sns.violinplot(x='Type 1', y='Attack',data=df) plt.xticks(rotation=30) plt.rcParams['xtick.labelsize'] = 15 plt.rcParams["axes.labelsize"] = 15

Violin plot is built using seaborn’s violinplot () function. Before this, we can set the background style of the violin plot by the ‘set_style’ method which has been given the value as ‘whitegrid’, there are other values as well like ‘dark’, ‘white’, ‘darkgrid’ and ‘ticks’.

Then, we can pass the x-axis and y-axis values and then customizing the plot using matplotlib methods.

**Uses:
**They help in representing a fantastically huge amount of information effectively in the small amount of space.

plt.figure(figsize=(10,4)) plt.xticks(rotation=30) sns.swarmplot(x='Type 1',y='Attack',data=df,dodge=False) plt.rcParams['xtick.labelsize'] = 15 plt.rcParams["axes.labelsize"] = 15

Now for plotting swarm plot, we use the ‘swarmplot ()’ function of seaborn with ‘Type 1’ and ‘Attack’ as the values. Initially, we have set the size of our swarm plot using ‘figure ()’ function.

**Uses:
**It gives a better representation of the distribution of values. It can be built on its own but is also a good complement to a box or violin plot.

Through overlaying of swarm and violin plots on similar values can help us to analyse in a much more efficient manner.

#Setting the figure with matplotlib plt.figure(figsize=(10,6)) plt.xticks(rotation=30) plt.rcParams["axes.labelsize"] = 15 #Creating the desired plot sns.violinplot(x='Type 1',y='Attack',data=df, inner=None #removes the inner bars inside the violins ) sns.swarmplot(x='Type 1',y='Attack',data=df) #Title for the plot plt.title('Attacks by various types')

First, to make this plot, we will make the figure a bit larger by using the figure () function of matplotlib. After this, we have used violinplot () function for plotting the violin plot using the ‘Type 1’ and ‘Attack’ values and data is given ‘df’ as value. We have also used ‘inner’ argument with ‘None’ as the value for removing the inner bars which are inside the violin.

Next, we will plot the swarm plot using swarmplot () function with the same values for x-axis and y-axis. We have also provided the title of the plot using title () function of matplotlib.

Here in this plot, we can see the swarm plot over violin plot. But the points of swarm plot is not clearly visible, so we will try to remove this using our following code:

#Setting the figure with matplotlib plt.figure(figsize=(10,6)) plt.xticks(rotation=30) plt.rcParams["axes.labelsize"] = 15 #Creating the desired plot sns.violinplot(x='Type 1',y='Attack',data=df, inner=None #removes the inner bars inside the violins ) sns.swarmplot(x='Type 1',y='Attack',data=df, color='k',#for making the points black alpha=0.6) #value of alpha will increase the transparency #Title for the plot plt.title('Attacks by various types')

In the above code, to make the swarm plot points visible we will be using two arguments i.e. ‘color’ and ‘alpha’. We have specified the color value as ‘k’ which represents black and ‘alpha’ is used to increase/decrease the transparency which is why we have given the value as ‘0.6’

Now we can clearly see the points of swarm plot over the violin plots. Therefore, this is our final plot where the swarm plot is over the violin plot.

**Uses:
**It helps in visualizing the distribution of data and the probability density.

In this, we get to know how much the data is scattered from the two columns which are under consideration.

sns.set_style('whitegrid') plt.figure(figsize=(10,8)) plt.rcParams['xtick.labelsize'] = 15 plt.rcParams['ytick.labelsize'] = 15 sns.kdeplot(df.Attack,df.Defense)

For plotting the joint kernel density plot, we proceed with the styling which is done through seaborn and matplotlib. After that, we will use the kdeplot () function of Seaborn. Here we can see that the arguments to the kdeplot () are passed differently as compared to other plotting functions. Here, we use ‘dataframe (df)’ to call the values for both the axes.

**Uses****:
**It helps in finding the probability density function of any dataset. This can help in smoothing the data around values of PDF.

sns.jointplot(x="Sp. Atk",y='Sp. Def',data=df,kind='scatter',color='g')

Joint plot is build using the jointplot () function of seaborn where we provide the values of x-axis and y-axis along with this we give the argument ‘kind’ for specifying the plot which we are creating jointly, here we have given the value as ‘scatter’ and we have even specified the ‘color’ value as ‘g’ i.e. green. There are other values for ‘kind’ which are ‘reg’, ‘resid’, ‘hex’ etc.

**Uses:
**Through joint plot, we get the liberty to use two plots for representation of the same data which helps in a better analysis.

For plotting Heatmap we will be using a different dataset i.e. ‘flights.csv’ which is an in-built dataset in Seaborn library and we will be load this dataset using seaborn itself.

**Loading the dataset using Seaborn**

flights = sns.load_dataset('flights')

Here we have used the load_dataset function to load the ‘flights’ dataset for Visualization.

**Displaying the flight’s dataset.**

flights.head()

**Creating a pivot table.**

flights = flights.pivot('month','year','passengers')

To plot the Heatmap, we will be required to draw the correlation between the columns which is done through the pivot () function where we have passed month and year as x-axis and y-axis values respectively and passengers for a range.

Plotting the Heatmap

sns.heatmap(flights,cmap="OrRd")

Now to plot the Heatmap, we use the heatmap () function of Seaborn where we have passed the dataset flights as one argument and color of the Heatmap as ‘OrRd’ i.e. Orange and Red.

**NOTE: **There are many cmap values which can be passed for trying out.

In heatmap we can see as the value of range gets higher the intensity of color increases and for lesser values, the color is lighter in shade.

sns.heatmap(flights,cmap="coolwarm",annot=True,fmt='d',linewidths=2) plt.xticks(rotation=70)

To make the heatmap more informative we can add some more arguments to the heatmap () function. The values displayed in each cell is represented by ‘annot’ argument which is set as ‘True’ and for displaying those values in a decimal format we use the ’fmt’ argument. Lastly, to make it neater we will draw borders between the lines using ‘linewidths’ argument, we can specify different values for it as per our requirements.

**Uses:
**Since it is the 2-dimensional figure, it helps in visualizing complex data. We can even use it to represent large datasets.

plt.figure(figsize=(6,4)) sns.clustermap(flights,cmap='RdYlBu',linewidths=0.5,figsize=(8,8),annot=True,fmt='d')

So we’ll end this tutorial with another graph i.e. Clustermap which is built using the clustermap () function of seaborn. Clustermap is very similar to heatmap, the only difference is that the similar values are combined together and represented in clustermap, which is evident from its name.

Here again, we have use matplotlib for designing the figure. The other arguments are similar to the previous heatmap example.

Here you can see the months are not in the known order, this is because the clustermap has combined the months on the basis of similar values i.e. clusters. Similarly, the years are clustered on the same criteria.

sns.clustermap(flights,col_cluster=False,cmap='cool',linewidths=2,standard_scale =0)

Even we can control the clustering in clustermap, this can be done by using ‘col_cluster’/’row_cluster’ arguments which are given values as true/false. In the above example, we have set the col_cluster as ‘False’ resulting in no clustering for columns. Secondly, we have used standard_scale for standardizing our dataset either by columns or rows, when we give value as ‘0’ the data across rows will be standardized and if we pass ‘1’ the column data will be standardized.

Here at the top-left corner, we can see the range has been changed (200 – 600 in the previous example) and has been made from 0.0 to 1.0, so this how the values are standardized.

**Uses:
**With the ability to standardize data, we can represent datasets through a cluster map which have columns with huge differences making it easier for analysis.

So finally we have reached the end of this Introduction to Data Visualization with Seaborn, I hope you would have liked it and most importantly learned from it.

The post Introduction To Python For Data Visualization With Seaborn appeared first on StepUp Analytics.

]]>The post Support Vector Machine appeared first on StepUp Analytics.

]]>- Introduction
- How does Support Vector Machine work (SVM) ?
- Kernel trick
- Implementing SVM in Python
- Advantages and Disadvantages
- Applications

Support Vector Machine (SVM) is a popular supervised machine learning algorithm which is used for both classification and regression. But it is mostly used for classification tasks. An SVM model is a representation of various data points in space such these points can be grouped into different categories by a clear gap between them that is as wide as possible. It looks at the extremes a.k.a support vectors (marked in the following figure) of datasets and draws a boundary which is known as hyper-plane.

When data is unlabelled, supervised learning is not possible. As a result, an unsupervised learning approach is used which attempts to find the natural clustering of data to form groups. The support vector clustering applies the statistics of support vectors to categorize the unlabelled data. This is one of the most widely used algorithms in industrial applications.

Let’s say you have some sample points in 2D space. Now you want to classify the stars and the circles with a hyperplane.

Basically, you can classify them perfectly using any of these 3 planes namely 1, 2, 3. But is there any systematic way to choose the right plane among them? The answer is YES!

The thumb rule is you need to identify the hyperplane that has the maximum distance between the nearest data points of either class. This distance is called Margin. Even if you draw any other plane parallel to hyper-plane 2 on either of its sides, its margin would be less as compared to that of hyper-plane 2. So the hyper-plane 2 is the correct choice.

**Note: **

- SVM selects a hyperplane in such a way that it classifies the objects accurately prior to maximizing margin. Hyper-plane 2 classifies all objects accurately whereas hyper-plane 1 has classification error. Hence hyper-plane 2 is the right plane.
- SVM is robust to outliers. It ignores the outliers and finds a plane with maximum margin.
Till now what we saw so far is linear support vector machine. These clusters could be separated linearly. But what if there exists a non-linear data set and you couldn’t separate them into different clusters using a hyperplane. Suppose you have a dataset like this. It looks impossible to separate them into two clusters using a hyper-plane keeping in mind the computational cost.

Here you can use the kernel function to convert the above data points into higher dimensional space. You can simply apply a polynomial function to convert it into a parabola function where the data points can be easily be separated using a single hyperplane as shown in the following figure.

Hence you can convert the 1D data points to 2D data points and also 2D data points to 3D data points. But the computational cost is high.

Kernel trick is like a magic wand which will boil down a complex non-separable data points into a simpler form at the same time it can minimize the computational cost. It takes input vectors in original space and returns the dot product of the vectors in feature space.

You can apply the dot product between two vectors. So that every point is mapped into higher dimensional space by some transformation. It is a technique in machine learning to avoid some intensive computation in some algorithms, which makes some computation goes from infeasible to feasible.

In the following input space, the red and the blue data points have been separated using a complex computational boundary. To minimize this computational cost, it has been transformed into a higher dimensional feature space (2D to 3D) where data points could be easily be separated into different clusters using a hyperplane.

Some popular kernel function are:

- Polynomial kernel
- Gaussian Radial basis function (RBF) kernel
- Gaussian kernel
- Laplace RBF kernel
- Sigmoid kernel etc.

No matter which kernel you use, it is important to tune its parameters

#download dataset from IPython.display import HTML HTML('<iframe src=https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data </frame>')

import pandas as pd import numpy as np

import matplotlib.pyplot as plt

#import dataset from sklearn.datasets import load_iris iris=load_iris()

print(iris.target) print(iris.target_names) ### Output ### [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2] ['setosa' 'versicolor' 'virginica']

Storing them int different objects

X=iris.data y=iris.target

Partition into test and train data

from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test=train_test_split(X, y, test_size=0.40)

from sklearn.svm import SVC model=SVC() model.fit(X_train, y_train) ## Output ### SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False)

#predict result=model.predict(X_test) print(result) ### Output ### [2 2 1 1 2 1 0 0 0 2 1 0 2 2 0 2 2 1 2 0 1 0 0 2 1 2 0 0 2 1 0 1 2 2 0 2 2 2 1 1 1 1 1 1 0 2 1 0 1 2 1 1 2 0 2 0 0 0 0 0]

#classfication report and confusion matrix from sklearn.metrics import classification_report, confusion_matrix print(confusion_matrix(y_test, result), 'n', classification_report(y_test, result)) ### Output ### [[20 0 0] [ 0 18 2] [ 0 1 19]] precision recall f1-score support 0 1.00 1.00 1.00 20 1 0.95 0.90 0.92 20 2 0.90 0.95 0.93 20 avg / total 0.95 0.95 0.95 60

Finding the best parameters value using a grid search

from sklearn.grid_search import GridSearchCV #finding best combination of C and gamma parameter_grid = {'C':[0.1,1,10,100,1000], 'gamma':[1,0.1,0.01,0.001,0.0001]} grid=GridSearchCV(SVC(),parameter_grid,verbose=3) grid.fit(X_train, y_train)

**Output**

Fitting 3 folds for each of 25 candidates, totalling 75 fits [CV] C=0.1, gamma=1 .................................................. [CV] ......................... C=0.1, gamma=1, score=1.000000 - 0.0s [CV] C=0.1, gamma=1 .................................................. ... .. .. [Parallel(n_jobs=1)]: Done 75 out of 75 | elapsed: 0.2s finished GridSearchCV(cv=None, error_score='raise', estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False), fit_params={}, iid=True, n_jobs=1, param_grid={'C': [0.1, 1, 10, 100, 1000], 'gamma': [1, 0.1, 0.01, 0.001, 0.0001]}, pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=3)

C parameter controls the cost of misclassification. large c value gives the low bias and high variance.

grid.best_params_ {'C': 1, 'gamma': 1}

grid.best_estimator_ SVC(C=1, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma=1, kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False)

grid_predictions=grid.predict(X_test) print(confusion_matrix(y_test, grid_predictions)) print('n') print(classification_report(y_test, grid_predictions)) [[20 0 0] [ 0 18 2] [ 0 2 18]] precision recall f1-score support 0 1.00 1.00 1.00 20 1 0.90 0.90 0.90 20 2 0.90 0.90 0.90 20 avg / total 0.93 0.93 0.93 60

**Advantages: **

- It is a robust model to solve prediction problems.
- It works effectively even if the number of features is greater than the number of samples.
- Non-Linear data can also be classified using customized hyper-planes built by using kernel trick.

**Disadvantages:**

- Choosing the right kernel is difficult sometimes.
- SVM can be extremely slow in the test phase.
- High algorithm complexity and huge memory requirement due to quadratic programming.
- when (number of samples) > (number of features), it gives poor results.

**Face detection:**Support vector machine (SVM) can classify a image as face or non-face.**Bioinformatics:**protein classification and cancer classification.**Text and hypertext classification:**classifies natural text or hypertext documents based on their content like email filtering.**Handwriting detection:**Support vector machine (SVM) used to identify handwritten characters that use for data entry and validating signatures on documents.**Image Classification**

For further studies, latest updates or interview tips on data science and machine learning, subscribe to our emails.

The post Support Vector Machine appeared first on StepUp Analytics.

]]>