The post Steps Of K-Means Clustering In R appeared first on StepUp Analytics.

]]>Clustering can be used to improve predictive accuracy by segmenting databases into more homogeneous groups. Then the data of each group can be explored, analyzed, and modeled.

Clustering is used to classify items or cases into relatively homogeneous groups called clusters and objects in one cluster tend to be similar to each other and dissimilar to objects in the other clusters.

K-Means Clustering groups items or observations into a collection of K clusters and the number of clusters, K, may either be specified in advance or determined as a part of the clustering procedure. K-Means clustering has been included in the Machine Learning section of CS2 (Risk Modelling and Survival Analysis). Let’s have a look at the procedure and how it’s applied in R.

**1.** Partition the items into K initial clusters, where K is any initial estimate of the number of clusters which can be determined according to the business requirements. Alternatively, it can be determined by using the elbow method (which is a widely used technique), which will be discussed further in this article.

**2. **Euclidean distance with either standardized or unstandardized observations is calculated. Assign an item to the cluster whose centroid (mean) is nearest. Recalculate the centroid for the cluster receiving the new item and for the cluster losing the item.

**3.** Repeat Step 2 until no more reassignments take place.

Let’s have a look at K-Means Clustering on the Wholesale Customer dataset (ref. UCI Machine Learning Repository) using R.

**Data description:**

- FRESH: annual spending on fresh products (Continuous);
- MILK: annual spending on milk products (Continuous);
- GROCERY: annual spending on grocery products (Continuous);
- FROZEN: annual spending on frozen products (Continuous)
- DETERGENTS_PAPER: annual spending on detergents and paper products (Continuous)
- DELICATESSEN: annual spending on and delicatessen products (Continuous)

*K-Means function can be used from the ‘stats’ package in R. You can install and call the package by the following steps:*

install.packages(“stats”)

library(stats)

**Step 1:** Read the data using import dataset or read.csv( ), and assign it to data1.**Step2: **Getting the descriptives of the data using summary( ) in R

**Step 3:** Here, we observe that the data has a large range of values for some variables as compared to others. The variables with a larger range of values tend to dominate, so we standardize all the variables so that each uses the same range. We rescale the variables so that they have a mean of 0 and a standard deviation of 1.

A large z-score implies that observation is far away from the mean in terms of standard deviation, eg. A z-score of 3 means that the observation is 3 standard deviations away from the mean.

We rescale the data using scale( ) in R.

data1 <- scale(data1)

**Step
4:** Now we need to find the optimal number of
clusters, K. The elbow method analyses how the homogeneity or heterogeneity
within the clusters changes for various values of K. Homogeneity within
clusters usually increases as additional clusters are added and heterogeneity
decreases. The goal is to find that value of K beyond which there is negligible
gain in information. If one plots the percentage of variance explained by the
clusters against the number of clusters, the first clusters will add much
information (explain a lot of variance), but at some point the marginal gain
will drop, which is indicated by an elbow in the curve, thus called the Elbow
method.
We do this in R using a function which gives within
sum of squares for different values of no. of clusters.

withss <- sapply(1:10,

function(k) {

kmeans(data1, k, nstart = 50, iter.max = 15)$tot.withinss})

Plotting Within Sum of squares v/s Number of clusters, using plot( ) in R.

plot( 1:10, withss, type = “b”, pch = 19, frame = FALSE, xlab = “Number of clusters”, ylab = “Within Sum of squares” ) axis(1, at = 1:10, labels = seq(1, 10, 1))

In figure 1, we observe that there is an elbow at 2 and 5 number of clusters. On analysis for values of K ranging 2 to 5, we observe that the optimal no. of clusters is 3 for understanding the optimal customer segmentation. The 5 cluster solution gives a more detailed customer segmentation, but at this stage, we’ll have a look at the 3 cluster solution. Thus, work out cluster analysis for 3 clusters using kmeans( ) in R:

clust_output <- kmeans(data1, centers = 3)

**Step 5: ** Analyzing the cluster analysis output,

There are 3 clusters of sizes 49, 347 and 44 respectively. The cluster centers give us insights about the cluster description.

**Cluster 1** has highest spenders on Fresh, Frozen products and Delicatessen products. This cluster consists of consumers who spend more on fine foods and are high spenders.

**Cluster 2** has low spenders across all products.

**Cluster 3** has highest spenders on Milk, Grocery, Detergent and Paper. This cluster consists of consumers who spend majorly on domestic and household products.

The post Steps Of K-Means Clustering In R appeared first on StepUp Analytics.

]]>The post What Is Classification appeared first on StepUp Analytics.

]]>Classification is the process of predicting the class of given data points. Classes are sometimes called as targets/ labels or categories. Classification predictive modeling is the task of approximating a mapping function (f) from input variables (X) to discrete output variables (y).

Unlike regression where you predict a continuous number, you use classification to predict a category. There is a wide variety of classification applications from medicine to marketing.

Let’s say you own a shop and you want to figure out if one of your customers is going to come visit your shop again or not. The answer to that question can only be a ‘Yes’ or ‘No’.

These kind of problems in Machine Learning are known as Classification problems.

Classification problems normally have a categorical output like a ‘yes’ or ‘no’, ‘1’ or ‘0’, ‘True’ or ‘false’. Let’s go through another example:

Say you want to check if on a particular day, a game of cricket is possible or not.

In this case the weather conditions are the dependent factors and based on them, the outcome can either be ‘Play’ or ‘Don’t Play’.

Just like Classification, there are two other types of problems in Machine Learning and they are:** **

**Regression and Clustering**

In the image above, we have the list for all the different algorithms or solutions used for each of the problems.

There are 5 types of algorithms used to solve classification problems and they are-

- Decision Tree
- Naive Bayes
- Random Forest
- Logistic Regression
- KNN

Based on the kind of problem statement and the data in hand, we decide the kind of classification algorithm to be used.

Decision tree builds classification or regression models in the form of a tree structure. It breaks down a data set into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes. A decision node has two or more branches and a leaf node represents a classification or decision. The topmost decision node in a tree which corresponds to the best predictor called root node. Decision trees can handle both categorical and numerical data.

It is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter. Even if these features depend on each other or upon the existence of the other features, all of these properties independently contribute to the probability that this fruit is an apple and that is why it is known as ‘Naive’.

Naive Bayes model is easy to build and particularly useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.

Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and P(x|c). Look at the equation below:

Above,

*P*(*c|x*) is the posterior probability of*class*(c,*target*) given*predictor*(x,*attributes*).*P*(*c*) is the prior probability of*class*.*P*(*x|c*) is the likelihood which is the probability of*predictor*given*class*.*P*(*x*) is the prior probability of*predictor*.

Random Forest is a supervised learning algorithm. Like you can already see from its name, it creates a forest and makes it somehow random. The “forest” it builds, is an ensemble of Decision Trees, most of the time trained with the “bagging” method. The general idea of the bagging method is that a combination of learning models increases the overall result.

**To say it in simple words: Random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction.**

One big advantage of random forest is, that it can be used for both classification and regression problems, which form the majority of current machine learning systems. I will talk about the random forest in classification since classification is sometimes considered the building block of machine learning. Below you can see how a random forest would look like with two trees:

Imagine a guy named Andrew, that wants to decide, to which places he should travel during a one-year vacation trip. He asks people who know him for advice. First, he goes to a friend, and asks Andrew where he traveled to in the past and if he liked it or not. Based on the answers, he will give Andrew some advice.

This is a typical decision tree algorithm approach. Andrews friend created rules to guide his decision about what he should recommend, by using the answers of Andrew.

Afterward, Andrew starts asking more and more of his friends to advise him and they again ask him different questions, where they can derive some recommendations from. Then he chooses the places that where recommend the most to him, which is the typical Random Forest algorithm approach.

The k-nearest-neighbours algorithm is a classification algorithm, and it is supervised: it takes a bunch of labelled points and uses them to learn how to label other points. To label a new point, it looks at the labelled points closest to that new point (those are its nearest neighbours), and has those neighbours vote, so whichever label the most of the neighbours have is the label for the new point (the “k” is the number of neighbours it checks).

*k*-Nearest Neighbour is a lazy learning algorithm which stores all instances correspond to training data points in n-dimensional space. When an unknown discrete data is received, it analyzes the closest k number of instances saved (nearest neighbors)and returns the most common class as the prediction and for real-valued data, it returns the mean of k nearest neighbors.

In the distance-weighted nearest neighbor algorithm, it weights the contribution of each of the k neighbors according to their distance using the following query giving greater weight to the closest neighbors.

Usually, KNN is robust to noisy data since it is averaging the k-nearest neighbors.

- What is Logistic Regression?
- How it works
- Logistic VS. Linear Regression
- Advantages / Disadvantages
- When to use it
- Implementation in Python

**Logistic regression** is a statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome. The outcome is measured with a dichotomous variable (in which there are only two possible outcomes). It is used to predict a binary outcome (1 / 0, Yes / No, True / False) given a set of independent variables. To represent the binary/categorical outcome, we use dummy variables.

You can also think of logistic regression as a special case of linear regression when the outcome variable is categorical, where we are using log of odds as the dependent variable. In simple words, it predicts the probability of occurrence of an event by fitting data to a **logit** function.

**Logistic regression **was developed by statistician **David Cox** in 1958. This binary logistic model is used to estimate the probability of a binary response based on one or more predictor (or independent) variables (features). It allows one to say that the presence of a risk factor increases the probability of a given outcome by a specific percentage.

Like all regression analyses, the **logistic regression** is predictive analysis. Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables.

**Application of Logistic Regression:** It’s being used in Healthcare, Social Sciences & various ML for advanced research & analytics.

Logistic Regression measures the relationship between the dependent variable (our label, what we want to predict) and the one or more independent variables (our features), by estimating probabilities using it’s underlying logistic function.

These probabilities must then be transformed into binary values in order to actually make a prediction. This is the task of the logistic function, also called the sigmoid function. The Sigmoid-Function is an S-shaped curve that can take any real-valued number and map it into a value between the range of 0 and 1, but never exactly at those limits. This values between 0 and 1 will then be transformed into either 0 or 1 using a threshold classifier.

The picture below illustrates the steps that logistic regression goes through to give you your desired output.

Below you can see how the logistic function (sigmoid function) looks like:

We want to maximize the likelihood that a random data point gets classified correctly, which is called Maximum Likelihood Estimation. Maximum Likelihood Estimation is a general approach to estimating parameters in statistical models. You can maximize the likelihood using different methods like an optimization algorithm.

Newton’s Method is such an algorithm and can be used to find the maximum (or minimum) of many different functions, including the likelihood function. Instead of Newton’s Method, you could also use Gradient Descent.

You may be asking yourself what the difference between logistic and linear regression is. Logistic regression gives you a discrete outcome but linear regression gives a continuous outcome. A good example of a continuous outcome would be a model that predicts the value of a house. That value will always be different based on parameters like it’s size or location. A discrete outcome will always be one thing (you have cancer) or another (you have no cancer).

It is a widely used technique because it is very efficient, does not require too many computational resources, it’s highly interpretable, it doesn’t require input features to be scaled, it doesn’t require any tuning, it’s easy to regularize, and it outputs well-calibrated predicted probabilities.

Like linear regression, logistic regression does work better when you remove attributes that are unrelated to the output variable as well as attributes that are very similar (correlated) to each other. Therefore Feature Engineering plays an important role in regards to the performance of Logistic and also Linear Regression. Another advantage of Logistic Regression is that it is incredibly easy to implement and very efficient to train. I typically start with a Logistic Regression model as a benchmark and try using more complex algorithms from there on.

Because of its simplicity and the fact that it can be implemented relatively easy and quick, Logistic Regression is also a good baseline that you can use to measure the performance of other more complex Algorithms.

A disadvantage of it is that we can’t solve non-linear problems with logistic regression since it’s decision surface is linear. Just take a look at the example below that has 2 binary features from 2 examples.

It is clearly visible that we can’t draw a line that separates these 2 classes without a huge error. To use a simple decision tree would be a much better choice.

Logistic Regression is also not one of the most powerful algorithms out there and can be easily outperformed by more complex ones. Another disadvantage is its high reliance on a proper presentation of your data. This means that logistic regression is not a useful tool unless you have already identified all the important independent variables. Since its outcome is discrete, Logistic Regression can only predict a categorical outcome. It is also an Algorithm that is known for its vulnerability to overfitting.

Like I already mentioned, Logistic Regression separates your input into two „regions” by a linear boundary, one for each class. Therefore it is required that your data is linearly separable, like the data points in the image below:

In other words: You should think about using logistic regression when your Y variable takes on only two values (e.g when you are facing a classification problem). Note that you could also use Logistic Regression for multiclass classification, which will be discussed in the next section.

Importing the Essential libraries

Importing the Dataset

Splitting the dataset into the Training set and Test set

Feature Scalling

Fitting logistic regression to the training set

Predicting the Test Set Result

Making the Confusion Matrix

Output of Confusion Matrix

Tips – so the confusion matrix it means that 65+24 =89 are the correct predictions and 8+3 =11 are the incorrect predictions.

Visualizing the Training set Results

Output of the Training Set

Visualizing the Test Set

Output of the Test Set

The post What Is Classification appeared first on StepUp Analytics.

]]>The post Cyber Security Using Machine Learning: SNORT appeared first on StepUp Analytics.

]]>Computer security or IT security is the protection of computer systems from theft or damage to their hardware, software or electronic data, as well as from disruption or misdirection of the services they provide. The field is of growing importance due to increasing reliance on computer systems, the Internet and wireless networks such as Bluetooth and Wi-Fi, and due to the growth of “smart” devices, including smartphones, televisions and the various tiny devices that constitute the Internet of Things. Due to its complexity, both in terms of politics and technology, it is also one of the major challenges of the contemporary world. (Source: Wikipedia)

The potential threats of the huge cyberspace are not hidden from anyone. Protecting our cyberspace is still a hot topic of research. In computers and computer networks an attack is an attempt to expose, alter, disable, destroy, steal or gain unauthorized access to or make unauthorized use of an Asset. There are two terms that are used very frequently while talking about cybersecurity: Intrusion Detection Systems (IDS) and Intrusion Prevention Systems (IPS).

IDS is the detection of any attack that has happened. IPS is the prevention of any such attack. It is easier to detect an attack than to completely prevent one. Machine learning can be used to increase the reliability of cybersecurity methods. In particular, we will talk about how machine learning can be used in Intrusion Detection Systems.

IDS can be classified into two main categories based on operational logic:

**Signature-based IDS****Anomaly-based IDS**

**Signature-based** IDS works with certain definitions of known vulnerabilities that are considered as attacks. Its operation logic is based on the basic classification problem. Incoming events are compared with signatures if a match found then an alert occurs; otherwise, it means no malicious event is found. It has low flexibility and it uses low-level machine learning structures. This system has very high accuracy for known attacks but fails in case of new attacks (zero-day attacks).

**Anomaly-based** IDS checks the behavior of the traffic and whenever there is an anomaly in the usual behavior, an alarm is raised. It has high flexibility and it uses high-level machine learning structures.

A lot of research has been going on in this area using both supervised and unsupervised algorithms. For the academic purpose, there are a lot of datasets available on the web for public use. The most popular is KDD99. The KDD data set is a well-known benchmark in the research of Intrusion Detection techniques. A lot of work is going on for the improvement of intrusion detection strategies while the research on the data used for training and testing the detection model is equally of prime concern because better data quality can improve offline intrusion detection.

The supervised approach usually deals with known attacks. It follows an algorithm that runs on well-defined attacks, that is, signature-based IDS. The dataset contains various definitions of malicious activities. The system works with labeled events occurred in the network. One of the several intrusion defined in the dataset is created by the network flow data. Artificial neural network when encounters any intrusion, it looks for the definition of that intrusion in the dataset. If any definition is found, an alarm is raised. However, if any definition is not found, the intrusion is ignored.

This approach has a very high accuracy in recognizing well known malicious activities. False alarm rates are very low in this case. Bayesian networks along with Support Vector Machine (SVM) are used to detect attacks in a supervised approach. Many artificially intelligent antivirus used in applications that require high security and very low alarm false rates, such as computers containing military information, computers operating missile etc. Many institutions in the USA have turned to the supervised approach for the security of documents and critical information.

However, this approach fails in case of 0-day attack.

Unsupervised approach for detection of cyber attack is used when the dataset doesn’t contain any definitions. The class of the attack, its features or anything about the attack is unknown. This approach assumes that huge change in the network flow happens only when any malicious agent has entered the system. The behavior of the network is monitored continuously. A threshold is set and whenever the anomaly crosses this threshold, alarm is raised.

In this approach the neural network functions on the network data rather than any class definitions. Thus it is very efficient in detecting 0 day attacks. However if the attacker produces the data intelligently, it can be surpassed. Moreover it creates a lot of false alarms. This is a major issue and research is going on to improve this algorithm.

Both techniques have advantages and disadvantages, to combine advantages in an efficient way, and eliminate disadvantages completely, some hybrid approaches are developed. A part of detection mechanism is working with the supervised algorithm, and another part is working with the unsupervised algorithm. In recent years most of the researches focus on hybrid detection approaches.

Snort is a free and open source network intrusion prevention system (NIPS) and network intrusion detection system (NIDS) and used all around the world. Snort’s open source network-based intrusion detection system (NIDS) has the ability to perform real-time traffic analysis and packet logging on Internet Protocol (IP) networks. Snort performs protocol analysis, content searching, and matching. These basic services have many purposes including application-aware triggered quality of service, to de-prioritize bulk traffic when latency-sensitive applications are in use.

Snort can be configured in three main modes: sniffer, packet logger, and network intrusion detection. In sniffer mode, the program will read network packets and display them on the console. In packet logger mode, the program will record packets to the disk. In the intrusion detection mode, the program will monitor network traffic and analyze it against a rule set defined by the user. The program will then perform a specific action based on what has been identified (Source Wikipedia).

Cyber attack detection is like a game between the attacker and the detection system. This is no ultimate winner in this game. Whenever an attack is detected, the attacker comes up with an efficient hacking algorithm that could surpass the detection. And whenever any attack surpasses the detection, new and efficient detection algorithms are developed. It is a never-ending cycle. Machine learning has improved the detection algorithms to a great extent. However, intelligent hackers are developing attacks that could surpass these by exploiting loopholes. Intense research is going on to remove these loopholes and come up with better algorithms.

The post Cyber Security Using Machine Learning: SNORT appeared first on StepUp Analytics.

]]>The post Beginner’s Guide to Reinforcement Learning appeared first on StepUp Analytics.

]]>- What is Reinforcement learning in simple words?
- The components of Reinforcement Learning problem.
- Distinguishing between Reinforcement learning, Supervised and Unsupervised learning.
- Algorithms used for implementing RL.
- Practical implementation of Reinforcement learning.
- Ways used for learning.
- The disadvantage of Reinforcement Learning.
- Applications of Reinforcement Learning around us.
- Real world implementation of Reinforcement Learning.

Reinforcement Learning is learning the best actions on the basis of rewards and punishment. But when we wear our technical goggles, then Reinforcement Learning is defined using three basic concepts i.e. states, actions, and rewards.

Here the “**state” **defines a situation in which an agent is present who performs some “**actions**” and based upon these actions the agent receives either rewards or punishment.

When we consider the example of the dog, there we have the owner of the dog and the “**dog**” (**Agent**) itself. Now when the owner of the dog is present in the garden with the dog, he/she throws away a ball**. **This throwing away of the ball is the “**state**” for the **agent** and now the **dog **will run after the ball which will be the “**action”. **

The result will be an appreciation or food for the dog from the owner which will be “**reward” **as a result of the action and if the dog does not go after the ball another alternate action then it may get some “**punishment”**. Therefore, this is what Reinforcement Learning is all about. Next, we’ll understand the terminology which Reinforcement learning comprises of.

Now for each and every Reinforcement Learning problem, there are some predefined components which help in better representation and understanding of the problem. The following are the components:-

**Agent**: Agent takes actions; as mentioned earlier in our example, the dog is the **agent**

**Action (A)**: The agent has set of actions **A **from which it selects which action to perform. Just like the dog who decided whether to go after the ball, just look at the ball or jump at the position.

**Discount Factor: **The **discount factor** is multiplied with the future rewards as discovered by the agent to reduce the effect of the agent’s choice of action. To simplify this, through **discount factor** we are making the future rewards less valuable than immediate rewards. This makes the agent look at short-term goals itself. So lesser the value of discount factor the more insignificant future rewards will become and vice versa.

**Environment: **It is the surroundings of the agent in which it moves. In the dog example, **the environment **consists of the owner and the garden in which the dog is present. It is the **environment **which gives the agent its rewards as an output based upon the agent’s current state and action as inputs.

**State: **A state is an immediate situation in which the agents finds itself in relation to other important things in the surroundings like tools, obstacles, enemies and prizes/rewards. Here the dog is required to

**Reward(R): **The **reward** is the output which is received by the agent in response to the actions of the agent. For example, the dog receives **dog food** as a **reward** if the dog (**agent**) brings back the ball otherwise it receives scolding as a **punishment** if it does not wish to do so.

**Policy: **Here policy is the strategy which agent uses to determine the actions which should be taken on the basis of the current state. Basically the agent’s maps states to actions i.e. it decides the actions which are providing the maximum rewards with regards to states. Talking about the dog example, when the dog comes to know that dog food will be given as a reward if it brings back the ball, keeping this in mind the dog will create its own policy to reap maximum rewards.

**Markov Decision Processes (MDP’s) **are mathematical frameworks to describe an environment in reinforcement learning and almost all RL problems can formalize using MDP’s.

Basically, **MDP’s **consist of a set of finite environment states S, a set of possible actions A(s) in each state, a real-valued reward function R(s) and a transition model as well.

All those who possess some basic knowledge of Artificial Intelligence would be well aware of the terms Supervised and Unsupervised learning. Similarly, Reinforcement learning has been the buzzword in the field of AI and its implementations have gained huge popularity. **For example**, The very famous **AlphaGo** was developed using Reinforcement learning by **Google Deepmind**, which went on to defeat the World Champion “**Lee Sedol”** of the Game **Go**.

Now you must be wondering why supervised learning or unsupervised learning was not used. So let’s look at the areas where Reinforcement learning is better as compared to the other two methods.

Supervised Learning gets its name from the usage of an external supervisor who is aware of the environment and shares the same knowledge with the agent for accomplishing the task. In general supervised learning is like learning from tasks which have been already completed and as an agent you have to obtain the experience from this. But in some cases, there are no tasks from which any experiences can be gained and thus we cannot have any supervisor.

Since the game of Go has to move counts in billions we cannot create a knowledge repository and thus, the only option left is playing more and more games to gain experience and extract knowledge from it.

So both supervised learning and reinforcement learning we are mapping between input and output but in reinforcement learning the reward function acts as the feedback or experience which is in contrast to supervised learning.

In unsupervised learning, there is no concept of mapping between input and output, unlike reinforcement learning. In unsupervised learning, our main aim is to find the hidden patterns. For example, most of the recommendation systems like movie recommendation, news articles use unsupervised learning for the same. So in this, we are building a knowledge graph on the basis of constant feedback which the customer provides by liking particular movies/articles and then similar things are recommended.

Supervised and Unsupervised Machine Learning Algorithms **Read**

Reinforcement learning along with its fundamental concepts needs to be implemented practically and for that, we use the following algorithms. Let’s have a look at those algorithms:

Q learning is the most used reinforcement learning algorithm. By the usage of this algorithm, the agent learns the quality (**Q value**) of each action (i.e. **policy**) based on how much reward the environment returns with.

Q Learning uses the table to store the value of each environment’s state along with the Q value.

SARSA resembles Q-learning to a lot extent. The only difference between the two is that SARSA learns the Q-value based on the action performed by the current policy as compared to Q-learning’s way of using greedy policy.

Now we will have a look at one of the basic implementation of Reinforcement Learning using the **OpenAI Gym library**. The Gym compares the different algorithms of Reinforcement Learning.

The gym provides us with a variety of test problem i.e. **environments**, all which can be used to know more about our reinforcement learning algorithms.

Before starting to work with Gym, we need to install gym using pip:

pip install gym

Directly from iPython notebook.

!pip install gym

After this, we are ready to start.

First, we are importing the gym library which we had installed earlier. Then we are using one of the inbuilt environments of Gym i.e. **CartPole**. This environment will display a pole trying to balance on a cart which is moving left and right.

**Step 2:**

Here we have created a function which will take care of the action which should be taken by the agent on the basis of state and environment. Using this function, our main aim is to maintain the pole present on the cart should try to balance and not fall down. So if the pole bends more than a given angle we are returning 0 as a result.

In this total list, we are storing rewards which will be collected

Here in this loop, we are calculating the rewards obtained in each episode by going over the loop. Each loop starts with an observation value which has been reset using reset () function.

Here we are running an instance of CartPole environment for 1000 timesteps, which will fetch the environment each time. There will be a small popup window displaying cart-pole using the render () function. Along with this, we are deciding the current action based upon the observation obtained by the agent’s previous actions by calling the basic_policy () function.

Now next we have used step () function which will return four values which are **observation: object type (**this will tell about the observation of the environment**)**, **reward: float type (**amount of reward received by previous action**)**, **done: Boolean type (**this tells about whether the episode has terminated or not**)**, and **info: dictionary type (**the information provided by this dictionary is used for debugging and also for learning about the environments**)**. So in this loop, at each timestep, the **agent chooses an action** and the **environment** returns an **observation** and a **reward**.

Lastly in the loop once the done variable returns the value as “true”, then we come out of the loop and append the episode_rewards value to the totals[].

Finally, we are printing the totals [] list which has the maximum rewards values and along with this, we are printing the maximum reward obtained. Most importantly we are using close () function to close the pop-up window otherwise, the program may crash.

The code of this implementation can be found here

To implement Reinforcement learning we need to have some predefined method of learning i.e. how the agent will be understanding which action should be taken to maximize the rewards.

For the above-mentioned reason, we have two methods used for learning which are as follows:-

In this method, the agent completes the episode (i.e. reaches a “terminal state”) and then looks at the **total rewards to see how well it has performed**. Here in the Monte Carlo method, the **rewards collection is done at the end of the episode** and then on the basis of the result, **the maximum expected future reward is calculated**.

**For example**: If we understand through the **dog example, **here the agent i.e. dog will be using **Monte Carlo approach** and completing the action of bringing the thrown ball back and then analyze the rewards which it received. On the basis of rewards, the dog will decide which actions should be performed in near future to maximize the reward.

When we look at the TD Learning method, here the rewards obtained are **analyzed after each step** and then on the basis of this only **maximum expected future reward** is calculated. Therefore, after each step, the agent decides which action should be taken to get maximum rewards.

**For example**: Again using the dog example, in this instance dog will look for appreciation after each step i.e. even if it starts running after the ball and looks at the owner appreciating then the dog will think of getting the reward. Similarly, if the dog is sitting and not going after the ball then the owner’s scolding will help the dog to understand and will make the dog change the action.

During any reinforcement learning problem the agent tries to build an optimal policy but at the same time it faces the dilemma of exploring new states while maximizing the rewards at the same time. This phenomenon faced by the agent is known as **Exploration vs. Exploitation Trade-off**.

To be precise, **Exploration** is finding more information about the environment and discovering new actions which can be taken to get more rewards. Whereas, **Exploitation Trade-off** is exploiting known information to maximize the rewards.

**In our dog example**, let’s consider the owner does not scold the dog for not bringing the ball and the dog is very lazy. So whenever the owner will throw the ball, the dog will not leave its place since the dog is getting the reward in the form of rest and it keeps on resting which is analogous to **Exploitation. **But if the dog tries to bring the ball back and discover that it receives food as a reward. This is what **Exploration** will be termed as since dog explored some new actions to get new rewards.

This drawback arises because the agent in most cases **memorizes** one path and will never try to explore any other paths. So we want that the **agent not only continues to exploit new paths but also keep on searching for new paths**, this is decided by a **hyper-parameter** which suggest how much exploration and how much exploitation is needed.

We have already discussed that Reinforcement learning is the best possible option where information about that particular task/environment is limited. So now let’s look at such applications:** Playing the Games like Go/Chess
**AlphaGo, as mentioned earlier, is a computer program that registered a victory against one of the best players in the world. AlphaGo has used RL for deciding which move should be taken based on current inputs and actions.

**Robot Control
**Robots have learned to walk, run, dance, fly, play various sports and perform mundane tasks using RL.

**Online Advertising
**Using reinforcement learning, the definition of broadcasting advertisements is totally changed. Now the user views the ads at right time as per their history and interests. There are applications of RL which include

**Dialogue Generation
**A conversational agent speaks a sentence based on future looking, i.e. long-term reward. So making involving both the speakers more involved in the conversation.

**Education and Training
**There are numerous online educational platforms which are looking to incorporate RL in their tutoring systems and personalized learning. With the use of RL, the students will have the advantage of having the study material as per their learning capability.

**Health and Medicine
**The RL shows results by looking at similar problems and how they were dealt with. Through this, RL suggests the optimal treatment policies for the patients.

**Finance
**RL has been used to perform various financial tasks like the stock prediction on the basis of past and present performance of stocks. There have been many companies trying to bring in Reinforcement learning application in their company’s functionality by bringing system for Trade execution.

**Real world implementation of Reinforcement Learning
**To get a deeper insight into how reinforcement learning is implemented, have a look at the following links for the same:-

- Reinforcement learning for Stock Prediction
- Reinforcement learning for meal-planning
- Reinforcement learning for Sports Betting

The post Beginner’s Guide to Reinforcement Learning appeared first on StepUp Analytics.

]]>The post Supervised vs Unsupervised Machine Learning appeared first on StepUp Analytics.

]]>So what is required for creating such machine learning systems? Following are the things required in creating such machine learning systems:

**Data –**Input data is required for predicting the output.**Algorithms –**Machine Learning is dependent on certain statistical algorithms to determine data patterns.**Automation –**It is the ability to make systems operate automatically.**Iteration –**The complete process is an iterative i.e. repetition of the process.**Scalability –**The capacity of the machine can be increased or decreased in size and scale.**Modeling –**The models are created according to the demand by the process of modeling.

Machine Learning methods are classified into certain categories. These are:

**Supervised Learning –**In this method, input and output are provided to the computer along with feedback during the training. The accuracy of predictions by the computer during training is also analyzed. The main goal of this training is to make computers learn how to map input to the output.

**Unsupervised Learning –**In this case, no such training is provided leaving computers to find the output on its own. Unsupervised learning is mostly applied to transactional data. It is used in more complex tasks. It uses another approach of iteration known as deep learning to arrive at some conclusions.

**Reinforcement Learning –**This type of learning uses three components namely – agent, environment, action. An agent is the one that perceives its surroundings, an environment is the one with which an agent interacts and acts in that environment. The main goal in reinforcement learning is to find the best possible policy.

Machine learning makes use of processes similar to that of data mining. Machine learning algorithms are described in terms of target function(f) that maps input variable (x) to an output variable (y). This can be represented as:

**y=f(x)**

There is also an error e which is the independent of the input variable x. Thus the more generalized form of the equation is:

**y=f(x) + e**

In machine, the mapping from x to y is done for predictions. This method is known as predictive modeling to make the most accurate predictions. There are various assumptions for this function.

Everything is dependent on machine learning. Find out what are the benefits of machine learning.

**Decision making is faster –**Machine learning provides the best possible outcomes by prioritizing the routine decision-making processes.**Adaptability –**Machine Learning provides the ability to adapt to new changing environment rapidly. The environment changes rapidly due to the fact that data is being constantly updated.**Innovation –**Machine learning uses advanced algorithms that improve the overall decision-making capacity. This helps in developing innovative business services and models.**Insight –**Machine learning helps in understanding unique data patterns and based on which specific actions can be taken.**Business growth –**With machine learning overall business process and workflow will be faster and hence this would contribute to the overall business growth and acceleration.**The outcome will be good –**With machine learning the quality of the outcome will be improved with lesser chances of error.

The post Supervised vs Unsupervised Machine Learning appeared first on StepUp Analytics.

]]>The post Introduction To Machine Learning appeared first on StepUp Analytics.

]]>Machine Learning is the science of getting computers to learn and act like humans do, and improve their learning over time in an autonomous fashion, by feeding them data and information in the form of observations. Machine learning is used in many domains, right from predicting if the next movie will be a success at the box office or not to the nuances of the stock market such as predicting the stock price.

Coming to the Actuarial profession, Machine learning has applications in pricing, reserving, product design, capital modeling to name a few. It has been introduced in CS2 as well, thus covering applications of concepts such as time series, Lee-Carter, pspline regression models using R.

Machine learning can be broadly classified into Supervised, Unsupervised and Reinforcement learning. Curriculum 2019 mainly focuses on Supervised and Unsupervised Machine learning.

Let’s have a look at what it is:

Supervised learning, as the name suggests, indicates a presence of a supervisor as teacher. It is a learning in which we teach or train the algorithm using data which is well labelled, that means some data is already tagged with the correct answer (training data). This algorithm consist of a target / outcome variable (or dependent variable) which is to be predicted from a given set of predictors (independent variables). Using these set of variables, we generate a function that map inputs to desired outputs. After that, the algorithm is provided with a new set of data so that the supervised learning algorithm analyses and produces a outcome from the labelled data.

Supervised learning has been broadly classified into regression and classification problems. Both problems have the goal of construction of a good model that can predict the value of the dependent variable from the independent variables. The difference between the two tasks is the fact that the dependent variable is numerical for regression and categorical for classification.

**Regression:**A regression problem is when the output variable is a real or continuous value, such as salary or weight. Regression predictive modeling is the task of approximating a mapping function (f) from input variables (X) to a continuous output variable (Y). For example, predicting the annual expenditure (dependent variable) of a person by using his annual income as the independent variable.

**Classification:**A classification problem is when the output variable is a category, such as yes or no, black or white. A classification model attempts to draw some conclusions from observed values. Given one or more inputs a classification model will try to predict the value of one or more outcomes. Classification models include Logistic regression, Decision tree, Random forest, Naive Bayes, to name a few. For example, predicting whether a person will default on his next loan payment on the basis of his income.

Many actuarial modeling projects such as insurance contract pricing, pension scheme valuation fall into the category of supervised learning.

Unlike supervised learning, no teacher is provided which means no training will be given to the machine. The information is neither classified nor labeled and the task of the machine is to group unsorted information according to similarities, patterns, and differences without any prior training of data.

In this algorithm, we do not have any target or outcome variable to predict. It is used for clustering population in different groups, which is widely used for segmenting variables under study in different groups. Unsupervised machine learning can be classified into two categories of algorithms:

**Clustering:**It is the task of grouping a set of objects in such a way that objects in the same group (a cluster) are more similar to each other than to those in other groups (clusters). A clustering problem is when you want to discover the inherent groupings in the data, such as grouping customers by purchasing behavior. Some examples are:- given class of buyers, cluster based on the buyer attributes
- given a set of tweets, cluster based on the content of the tweet

**Association:**An association rule learning problem is where you want to discover rules that describe large portions of your data, such as people that buy X also tend to buy Y. For example, the rule found in the sales data of a supermarket would indicate that if a customer buys bread and butter together, they are likely to buy tomato ketchup. Such information can be used as the basis for decisions about marketing activities such as promotional pricing or product placements.

Though Machine Learning is a vast topic, identifying the correct technique according to your problem and the data and going ahead in a systematic way is the key.

By now you would have got a brief idea of what Machine Learning is. Here is a list of few resources which can help you in understanding the concept in detail from an actuarial perspective.

- https://www.actuaries.org.uk/documents/practical-application-machine-learning-within-actuarial-work
- https://www.actuaries.org.uk/documents/modelling-analytics-and-insights-data-maid-working-party-terms-reference

The post Introduction To Machine Learning appeared first on StepUp Analytics.

]]>The post Support Vector Machine appeared first on StepUp Analytics.

]]>- Introduction
- How does Support Vector Machine work (SVM) ?
- Kernel trick
- Implementing SVM in Python
- Advantages and Disadvantages
- Applications

Support Vector Machine (SVM) is a popular supervised machine learning algorithm which is used for both classification and regression. But it is mostly used for classification tasks. An SVM model is a representation of various data points in space such these points can be grouped into different categories by a clear gap between them that is as wide as possible. It looks at the extremes a.k.a support vectors (marked in the following figure) of datasets and draws a boundary which is known as hyper-plane.

When data is unlabelled, supervised learning is not possible. As a result, an unsupervised learning approach is used which attempts to find the natural clustering of data to form groups. The support vector clustering applies the statistics of support vectors to categorize the unlabelled data. This is one of the most widely used algorithms in industrial applications.

Let’s say you have some sample points in 2D space. Now you want to classify the stars and the circles with a hyperplane.

Basically, you can classify them perfectly using any of these 3 planes namely 1, 2, 3. But is there any systematic way to choose the right plane among them? The answer is YES!

The thumb rule is you need to identify the hyperplane that has the maximum distance between the nearest data points of either class. This distance is called Margin. Even if you draw any other plane parallel to hyper-plane 2 on either of its sides, its margin would be less as compared to that of hyper-plane 2. So the hyper-plane 2 is the correct choice.

**Note: **

- SVM selects a hyperplane in such a way that it classifies the objects accurately prior to maximizing margin. Hyper-plane 2 classifies all objects accurately whereas hyper-plane 1 has classification error. Hence hyper-plane 2 is the right plane.
- SVM is robust to outliers. It ignores the outliers and finds a plane with maximum margin.
Till now what we saw so far is linear support vector machine. These clusters could be separated linearly. But what if there exists a non-linear data set and you couldn’t separate them into different clusters using a hyperplane. Suppose you have a dataset like this. It looks impossible to separate them into two clusters using a hyper-plane keeping in mind the computational cost.

Here you can use the kernel function to convert the above data points into higher dimensional space. You can simply apply a polynomial function to convert it into a parabola function where the data points can be easily be separated using a single hyperplane as shown in the following figure.

Hence you can convert the 1D data points to 2D data points and also 2D data points to 3D data points. But the computational cost is high.

Kernel trick is like a magic wand which will boil down a complex non-separable data points into a simpler form at the same time it can minimize the computational cost. It takes input vectors in original space and returns the dot product of the vectors in feature space.

You can apply the dot product between two vectors. So that every point is mapped into higher dimensional space by some transformation. It is a technique in machine learning to avoid some intensive computation in some algorithms, which makes some computation goes from infeasible to feasible.

In the following input space, the red and the blue data points have been separated using a complex computational boundary. To minimize this computational cost, it has been transformed into a higher dimensional feature space (2D to 3D) where data points could be easily be separated into different clusters using a hyperplane.

Some popular kernel function are:

- Polynomial kernel
- Gaussian Radial basis function (RBF) kernel
- Gaussian kernel
- Laplace RBF kernel
- Sigmoid kernel etc.

No matter which kernel you use, it is important to tune its parameters

#download dataset from IPython.display import HTML HTML('<iframe src=https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data </frame>')

import pandas as pd import numpy as np

import matplotlib.pyplot as plt

#import dataset from sklearn.datasets import load_iris iris=load_iris()

print(iris.target) print(iris.target_names) ### Output ### [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2] ['setosa' 'versicolor' 'virginica']

Storing them int different objects

X=iris.data y=iris.target

Partition into test and train data

from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test=train_test_split(X, y, test_size=0.40)

from sklearn.svm import SVC model=SVC() model.fit(X_train, y_train) ## Output ### SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False)

#predict result=model.predict(X_test) print(result) ### Output ### [2 2 1 1 2 1 0 0 0 2 1 0 2 2 0 2 2 1 2 0 1 0 0 2 1 2 0 0 2 1 0 1 2 2 0 2 2 2 1 1 1 1 1 1 0 2 1 0 1 2 1 1 2 0 2 0 0 0 0 0]

#classfication report and confusion matrix from sklearn.metrics import classification_report, confusion_matrix print(confusion_matrix(y_test, result), 'n', classification_report(y_test, result)) ### Output ### [[20 0 0] [ 0 18 2] [ 0 1 19]] precision recall f1-score support 0 1.00 1.00 1.00 20 1 0.95 0.90 0.92 20 2 0.90 0.95 0.93 20 avg / total 0.95 0.95 0.95 60

Finding the best parameters value using a grid search

from sklearn.grid_search import GridSearchCV #finding best combination of C and gamma parameter_grid = {'C':[0.1,1,10,100,1000], 'gamma':[1,0.1,0.01,0.001,0.0001]} grid=GridSearchCV(SVC(),parameter_grid,verbose=3) grid.fit(X_train, y_train)

**Output**

Fitting 3 folds for each of 25 candidates, totalling 75 fits [CV] C=0.1, gamma=1 .................................................. [CV] ......................... C=0.1, gamma=1, score=1.000000 - 0.0s [CV] C=0.1, gamma=1 .................................................. ... .. .. [Parallel(n_jobs=1)]: Done 75 out of 75 | elapsed: 0.2s finished GridSearchCV(cv=None, error_score='raise', estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False), fit_params={}, iid=True, n_jobs=1, param_grid={'C': [0.1, 1, 10, 100, 1000], 'gamma': [1, 0.1, 0.01, 0.001, 0.0001]}, pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=3)

C parameter controls the cost of misclassification. large c value gives the low bias and high variance.

grid.best_params_ {'C': 1, 'gamma': 1}

grid.best_estimator_ SVC(C=1, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma=1, kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False)

grid_predictions=grid.predict(X_test) print(confusion_matrix(y_test, grid_predictions)) print('n') print(classification_report(y_test, grid_predictions)) [[20 0 0] [ 0 18 2] [ 0 2 18]] precision recall f1-score support 0 1.00 1.00 1.00 20 1 0.90 0.90 0.90 20 2 0.90 0.90 0.90 20 avg / total 0.93 0.93 0.93 60

**Advantages: **

- It is a robust model to solve prediction problems.
- It works effectively even if the number of features is greater than the number of samples.
- Non-Linear data can also be classified using customized hyper-planes built by using kernel trick.

**Disadvantages:**

- Choosing the right kernel is difficult sometimes.
- SVM can be extremely slow in the test phase.
- High algorithm complexity and huge memory requirement due to quadratic programming.
- when (number of samples) > (number of features), it gives poor results.

**Face detection:**Support vector machine (SVM) can classify a image as face or non-face.**Bioinformatics:**protein classification and cancer classification.**Text and hypertext classification:**classifies natural text or hypertext documents based on their content like email filtering.**Handwriting detection:**Support vector machine (SVM) used to identify handwritten characters that use for data entry and validating signatures on documents.**Image Classification**

For further studies, latest updates or interview tips on data science and machine learning, subscribe to our emails.

The post Support Vector Machine appeared first on StepUp Analytics.

]]>The post Difference between K Means Clustering and Hierarchical Clustering appeared first on StepUp Analytics.

]]>For example, we take the case of business intelligence. In business, intelligence clustering can be used in grouping the customers where the similarity within the group is then the similarity between the groups which facilitates the development of business strategies for enhancing customer relationship management. In the time of grouping, we go through several methods such as Partitioning Methods, Hierarchical Methods, Density-Based methods, Grid-based Methods.

In Partitioning methods, we find the mutually exclusive cluster of spherical shape based on distance. In this case, we can use mean or median as a cluster centre to represent each cluster. It is helpful in the small and medium size of data.

In Hierarchical methods, we create hierarchical decomposition of the given set of data. We create hierarchical decomposition in two ways such as from bottom to the top or top to down. On the basis how we create hierarchical decomposition we divide this method into two approaches one is agglomerative approach and other is the divisive approach.

The main problem of this process is once a step is done it can never be undone. In Density-Based methods, we just find arbitrarily shaped clusters which are dense regions of objects in space that are separated by low-density regions. In this method, each point must be belonging in the strongly dense neighbourhood which detects outlier so easily. Grid-based methods quantize the object space into a finite number of cells that form a grid structure without depending on the number of the data objects.

We discuss till now what is clustering, why clustering is so much required in data analysis and different methods of clustering also. Now our target is to discuss first two methods in details and also the difference between them.

Before going to the discussion about Partitioning methods let me tell you a funny story. Suppose you and your ten friends plan that when your parents will sleep all of you go to the playground to play cricket. After taking lunch all of your parents are going asleep but suddenly rain starts and after thirty minutes the rain stopped but when they are running through the road their footsteps make the cluster on the road.

Suddenly your mother is waking up and see that you are not in your house then she calls others’ parents and all of them come out and reach the destination following the cluster on the road. In data analysis, the same situation occurs if you find the cluster in the data then you catch the pattern of the data. Now we start the discussion in details about the first two methods.

In Partitioning Method, we organize the objects of a set into several exclusive groups of clusters. Formally, given a data set D of n objects and k the number of clusters to form, a partitioning algorithm organizes the objects into k partitions where k ≤ n where each partition represents a cluster.

Suppose a data set D, contains n objects in Euclidian space. Using partitioning method we distribute **n** data points in **k** non overlapping cluster such as c[1],c[2],…,c[k].where c[i] Ո c[j]=Փ . Now our objective is to find a representative of each of k clusters. In this case, we can apply two types of the representative of each of k clusters. One is centroid which is the centre of each cluster and another is most popular elements of each cluster such that the variation within a cluster is minimal.

If we use mean as representative of each cluster, then the clustering method is named as K-means clustering and if we use medoid as representative of each cluster then the method of clustering is named as K-Medoid clustering. Mathematically it can be said in this way that let p[i] be the centre of i^{th} cluster c[i] and c[i,k] be the k^{th} point of i^{th} cluster if dist(p[i],c[i,k]) is minimum then p[i] be the centre of i^{th} cluster. Mathematically we check the closest point p[i] of the i^{th} cluster by minimizing the sum of the square of errors or sum of absolute errors. Let i^{th} cluster c[i] consists m_{i} points such that

p[i] be the centre of the i^{th} cluster.

c[i,j] be the j^{th} point of the i^{th} cluster.

We know that SSA is minimum when p[i] = median(c[i])

Perform k-Means clustering we just follow this type of algorithm.

**Input:**

**k:** The number of clusters

**D:** A data set containing n objects

**Output:** A set of k clusters

Method:

- Arbitrarily choose k objects from D as the initial cluster centres;
- Repeat;
- Reassign each object to the cluster to which the objects the most similar based on the mean of the objects in the cluster;
- Update the cluster means, that calculates the mean values of the objects for each cluster;
- Until no change.

For example, suppose the data points are

D = {2,4,10,12,3,20,30,11,25}

k=2

Let p[1]=2, p[2]=4

C[1]={2,3}, C[2]={4,10,12,20,30,11,25}

p[1]=2, p[2]=4

Repeat the process

p[1]=3, p[2]=18

C[1]={2,3,4,10}, C[2]={12,20,30,11,,25}

Repeat then

p[1]=4.75, p[2]=19.6

C[1]={2,3,4,10,11,12}, C[2]={20,30,25}

Repeat then

p[1]=7, p[2]=25

C[1]={2,3,4,10,11,12}, C[2]={20,30,25}

Repeat also

p[1]=7, p[2]=25

But we see that in the last step p[1] and p[2] is equal of the previous step so we stop the iteration in the previous step.

Now we discuss Hierarchical clustering. While partitioning methods meet basic clustering requirements of organizing the set of objects into a number of exclusive groups, in some situations we may want to partition our data into groups at different levels such as in a hierarchy.

This clustering method works by grouping data objects into a hierarchy or “tree” of clusters. For example, a company decides to share the entire information about their people. So they divide the employee into three groups such as executives, managers and staff. Then they divide the staff into three subgroups such as senior officer, officer and trainee. This is the method of Hierarchical Clustering.

Now we concentrate on how we do Hierarchical clustering. There are two approaches to do this clustering one is Agglomerative and other is Divisive. In Agglomerative approach, we start with the points as individual clusters and at each step merge the closest pair of clusters just like from bottom to the top.

In Divisive approach, we need to decide which cluster to split at each step and how to do the splitting. We start with one of the all-inclusive clusters and then divide it at each step in sub-clusters until we get only singleton cluster of individual points to remain. This method looks from top to bottom approach. Hierarchical Clustering is often displayed graphically using a tree-like diagram which is called a Dendrogram.

wineurl<-"https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data" wine<-read.csv(wineurl) wine1<-edit(wine) head(wine1) wine1 wine2<-wine1[,which(names(wine1) != "cultivar")] wine2 winek3<-kmeans(wine2,centers=3) winek3

plot(winek3$cluster) wine1<-edit(wine) head(wine1)

cultivar alcohol malicacid ash alcalinity magnesium totalphenols flavanoids nonflavanoidphenols proanthocynin 1 1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 2 1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 3 1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 4 1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 5 1 14.20 1.76 2.45 15.2 112 3.27 3.39 0.34 1.97 6 1 14.39 1.87 2.45 14.6 96 2.50 2.52 0.30 1.98 colorintensity hue od280od315ofdilutedwines proline 1 4.38 1.05 3.40 1050 2 5.68 1.03 3.17 1185 3 7.80 0.86 3.45 1480 4 4.32 1.04 2.93 735 5 6.75 1.05 2.85 1450 6 5.25 1.02 3.58 1290

To check the entire data set in wine1:

wine1 cultivar alcohol malicacid ash alcalinity magnesium totalphenols flavanoids nonflavanoidphenols proanthocynin 1 1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 2 1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 -------------------------------------------------------------------------------------------------------------------

**Adding filter:**

wine2 <- wine1[,which(names(wine1) != "cultivar")] wine2

Running Kmeans Clustering on the above dataset

winek3<-kmeans(wine2,centers=3) winek3 Output: K-means clustering with 3 clusters of sizes 46, 69, 62 Cluster means: alcohol malicacid ash alcalinity magnesium totalphenols flavanoids nonflavanoidphenols proanthocynin 1 13.79522 1.887174 2.426087 17.05435 105.04348 2.868696 3.013261 0.2854348 1.902174 2 12.51667 2.494203 2.288551 20.82319 92.34783 2.070725 1.758406 0.3901449 1.451884 3 12.92984 2.504032 2.408065 19.89032 103.59677 2.111129 1.584032 0.3883871 1.503387 colorintensity hue od280od315ofdilutedwines proline 1 5.703913 1.0791304 3.096522 1197.9783 2 4.086957 0.9411594 2.490725 458.2319 3 5.650323 0.8839677 2.365484 728.3387

**Clustering vector:**

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 1 1 3 3 1 1 3 1 1 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 1 1 1 1 3 3 1 1 3 3 1 1 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 2 3 2 2 3 2 2 3 3 3 2 2 1 3 2 2 2 3 2 2 3 3 2 2 2 2 2 3 3 2 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 2 2 2 2 3 3 2 3 2 3 2 2 2 3 2 2 2 2 3 2 2 3 2 2 2 2 2 2 2 3 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 2 2 2 2 2 2 2 2 2 3 2 2 3 3 3 3 2 2 2 3 3 2 2 3 3 2 3 3 2 2 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 2 2 3 3 3 2 3 3 3 2 3 2 3 3 2 3 3 3 3 2 2 3 3 3 3 3 2

Within cluster sum of squares by cluster:

[1] 1343168.5

[2] 443166.7

[3] 566572.5 (between_SS / total_SS = 86.5 %)

Available components:

[1] “cluster” [2] “centers” [3] “totss” [4] “withinss” [5] “tot.withinss”

[6] “betweenss” [7] “size” [8] “iter” [9] “ifault”

plot(winek3$cluster)

wine_H <-hclust(d=dist(wine2)) plot(wine_H)

From the above two diagrams, we realize that there are some differences between the two methods of clustering. Now we discuss the difference between these two types of clustering. Hierarchical clustering builds clusters within clusters and does not require a pre-specified number of clusters like k means. Now we discuss this difference in details. The most important difference is the **hierarchy**. Actually, there are two different approaches that fall under this name: top-down and bottom-up.

In top-down hierarchical clustering, we divide the data into 2 clusters (using k-means with k=2k=2, for example). Then, for each cluster, we can repeat this process, until all the clusters are too small or too similar for further clustering to make sense, or until we reach a preset number of clusters.

In bottom-up hierarchical clustering, we start with each data item having its own cluster. We then look for the two items that are most similar and combine them in a larger cluster. We keep repeating until all the clusters we have left are too dissimilar to be gathered together, or until we reach a preset number of clusters.

In **k-means** clustering, we try to identify the best way to divide the data into k sets simultaneously. A good approach is to take k items from the data set as initial cluster representatives, assign all items to the cluster whose representative is closest, and then calculate the cluster mean as a new representative; until it converges (all clusters stay the same).

The post Difference between K Means Clustering and Hierarchical Clustering appeared first on StepUp Analytics.

]]>