The post Beginners Guide to Statistical Cluster Analysis in Detail appeared first on StepUp Analytics.

]]>Here is the second part, we will cover the second part of data exploratory analysis techniques.

- Clusters do not have the hierarchy.
- No distance matrix calculation.
- Better suited for large dataset.

Non-HCA methods start from either-

- With an initial partition of items/objects into groups

**or**

- With an initial partition of the set of seed points, which will form the nuclei of initial clusters.

**NOTE: **One way to start is to randomly select seed points from among the items or to randomly partition the date(i.e. items/objects) into initial groups.

**K-means method or the method of iterative relocation:- **

**K-means is an algorithm that assigns each object to the cluster having the nearest centroid/nuclei**

**Algorithm:**

- Partition the items/objects into ‘k’ initial letters.
- Reassign items/objects to the cluster whose centroid is nearest. Recalculate the centroid/nuclei for the cluster receiving the new item and for the cluster using the items/objects.
- Repeat step 2 until no more reassignment is possible.

**NOTE:- **Rather than starting with a partition of all items into k initial groups( as in **step1 **) we can also assign ‘K’ initial centroid/nuclei(seed points) and then proceed to **step 2** after a walk through the data.

The handwritten solved example is attached below go through the steps and try to understand if not please shoot me an email @ irrfankhann29@gmail.com.

**statistical non-hierarchical cluster analysis This is the pdf file of the example**

**CLUSTER CRITERIA:-** Comparing different partitions-

**Objective:-** is to have a criterion for optimum partition of the data such that given set of cases of given clusters problem reduces to partition the data into ‘g’ clusters so that the clustering criterion is optimized.

- Let the ‘n’ data points(cases) be : x1, x1, – – – – – x(n).
- The sample variance – covariance is matrix is given
**∑ = (1/n)∑(x(i)-m)(x(i)-m)^t**: {m=**(1/n)∑x(i) -> sample mean**}

- Let there be ‘g’ clusters and define

Then the within-cluster sum of square (SS) & cross product matrix

==> pooled within-cluster scatter matrix ‘g’ cluster

The between cluster SS and cross product matrix

**Popular clustering criterion is based on univariate function S(b), S(w) or ∑.**

Will share the criteria in the next part of the cluster analysis series. Till then stay tuned or practice your skills on cluster analysis, if you get any doubts please ask me by shooting an email @ irrfankhann29@gmail.com.

Article Originally posted Here

The post Beginners Guide to Statistical Cluster Analysis in Detail appeared first on StepUp Analytics.

]]>The post Beginners Guide to Statistical Cluster Analysis in Detail Part-1 appeared first on StepUp Analytics.

]]>Cluster Analysis can be done by two methods:

- Hierarchical cluster analysis.
- Non-Hierarchical cluster analysis.

**Hierarchical Cluster Analysis(HCA):**

- In HCA, the observation vector(cases) are grouped together on the basis of their mutual distance.
- An HCA is usually visualised through a hierarchical tree called dendrogram tree. This hierarchical tree is a nested set of partitions represented by a tree diagram.

- Sectioning a tree at a particular level produces a partition into
**‘g’**disjoint groups. - If 2 groups are chosen from different partitions then either the groups are disjoint or 1 group is totally contained within the other.
- A numerical value is associated with each partition of the tree where branches join together. This value is a measure of distance or dissimilarity between two merged clusters.
- Different distance measures give rise to different hierarchical clusters structure.

**There are two types of approaches for HCA: **

- Agglomerative HCA
- Divisive HCA

**Agglomerative HCA: **

- Operates by successive merges of cases.
- Begin with clusters, each containing single cases.
- At each stage merge the 2 most similar group to form a new cluster, thus reducing the number of the cluster by n.
- Continue till(eventually as similarity decreases) all subgroups are fused to form one single cluster.

**Divisive HCA: **

- The divisive method operates by the successive splitting of groups.
- Initially starts with a single group(i.e. one single cluster).
- Group is divided into 2 types: 1) The objects in one subgroup are as far as possible from the objects in the other group. 2) Continue till there are ‘n’ groups, each with a single cluster.

**Note: **Result of both the approaches are displayed through the dendrogram tree.

**Steps Involved in Agglomerative HCA: **

- Starts with a cluster each containing a single object and an
**NxN**symmetric matrix of distances(or similarity).**D = ((D[i×j]))** - Search the distance matrix
**(D)**for nearest (most similar) pair of objects. Let the distance between the most similar cluster say (U&V) be denoted by**d[u×v].** - Merge clusters U & V to be as (U, V) as the new cluster(produces
**(n-1)×(n-1)**matrix), update the distance matrix by doing the following:- Deleting the rows & columns corresponding to the clusters U & V.
- Adding a row & a column giving the distances between the newly formed cluster (U, V) and the remaining cluster.

- Repeat points
**second & third**a total**(n-1) times.**Record the identity of clusters that are merged and the level(distance or dissimilarity) at which they are merged. - Structure the dendrogram tree from the information on mergers and merger levels.

**Possible distance measures between two clusters:**

- Single linkage-minimum distance or nearest neighbour approach

Here i∈k1, j∈k2

Distance between cluster 1 &2 ?

min[d(1,2),(3,4,5)] =

**Under single linkage approach min[d(1,2),(3,4,5)] = d(2,5)**

Here is the example of single linkage attached in pdf

New Doc 2017-12-19

- Complete Linkage – max distance between cluster

d(1,3), d(1,4),d(1,5) | d(2,3), d(2,4),d(2,5)

complete linkage distance between cluster 1 and 2 = d(1,4)

Here is the complete linkage example attached

New Doc 2017-12-19 (1)

- Average linkage – average distance

Average linkage distance between clusters

=1/6∑d(i,j) where **i,j is 1 to n**

- Centroid Linkage: Distance between centroids of two clusters.
- Median linkage: Distance between the median of two clusters.

Hierarchical cluster analysis ends here, in the next tutorial article I will explain Non-Hierarchical cluster analysis.

Till then stay tuned and keep visiting for learning tutorials which you won’t get anywhere.

If you have any doubts please mention in comments or shoot me an email @ irrfankhann29@gmail.com.

This article posted here.

The post Beginners Guide to Statistical Cluster Analysis in Detail Part-1 appeared first on StepUp Analytics.

]]>The post Heirarchical Clustering Techniques using R appeared first on StepUp Analytics.

]]>We have already learned k-means clustering using R and types of clustering analysis

Each of the samples can be thought of a sitting in an m-dimensional space, defined by the m variables (columns) in the data frame. We define similarity on the basis of the distance between two samples in this m-dimensional space.

Several different distance measures could be used, but the default is **Euclidean distance **and this is used to work out the distance from every sample to every other sample. for the other options, check

`?dist`

This quantitative dissimilarity structure of the data is stored in a matrix produced by the “**dist** function”.

Initially, each sample is assigned to its own cluster, and then the hclust algorithm proceeds iteratively, at each stage joining the two most similar clusters, continuing until there is just a single cluster.

for more details about the hclust function, check

`?hclust`

Here in this example, we will cluster the similar countries on the basis of similarity. so the decision making can be easier. In order to find the similarities in observation and group the data, we need to perform cluster analysis.

country | area | gdp | inflation | life expect | military | pop growth | unemployment |

Austria | 83871 | 41600 | 3.5 | 79 | 0.8 | 0.5 | 4 |

Belgium | 95326 | 37589 | 3.5 | 78 | 1.3 | 0.4 | 2 |

Bulgaria | 56356 | 13456 | 2.6 | 78 | 2.3 | 0.3 | 3 |

Crotia | 73569 | 18000 | 4.5 | 79 | 1.5 | 0.2 | 5 |

czech Republic | 43568 | 27156 | 4 | 78 | 1.6 | -2 | 1 |

denmark | 338155 | 37256 | 2 | 56 | 4 | 2 | 1.5 |

Estonia | 152632 | 20156 | 3 | 78 | 2 | 1.9 | 4 |

Germany | 132562 | 36252 | 4.9 | 74 | 2 | 1.8 | 3 |

Hungary | 93265 | 38265 | 5.9 | 69 | 3.1 | 1.5 | 3.5 |

Iceland | 100000 | 25655 | 1.5 | 65 | 4 | 1.2 | 3.6 |

Italy | 70125 | 19654 | 2.8 | 86 | 2 | -0.8 | 2.5 |

Latvia | 302325 | 38569 | 3.6 | 72 | 1.2 | 1.9 | 4 |

Lithuansia | 64523 | 40256 | 5.6 | 88 | 1.3 | -1.5 | 4.01 |

Luxemberg | 65235 | 32565 | 4.5 | 98 | 1.5 | 1.6 | 1.8 |

Netherland | 41256 | 12568 | 2.6 | 67 | 1.4 | 0.6 | 2.5 |

Norway | 326598 | 19568 | 7.2 | 73 | 1.69 | 0.3 | 1.23 |

Portugal | 312654 | 18652 | 1.53 | 74 | 2.6 | -1.2 | 1.6 |

Slovakia | 92356 | 45895 | 0.26 | 72 | 3.1 | 0.6 | 5 |

Slovenia | 49265 | 123654 | 2.25 | 75 | 1.5 | 0.5 | 6 |

Spain | 20125 | 26651 | 23.5 | 76.5 | 2 | 0.5 | 4.2 |

Sweden | 502354 | 21561 | 26.2 | 86.3 | 1.9 | -0.2 | 2.356 |

Switzerland | 495632 | 125465 | 56 | 56.9 | 1.8 | 0.003 | 1.8 |

In this example, we will use a hierarchical cluster analysis to group the countries. This cluster analysis also allows us to summarise the data by grouping all the similar observation into different clusters. These observations are made by considering similar values for the number of variables. i.e. if the eucladien distance between two values is less than they are group together we can perform cluster analysis with the** dist and hclust** function.

**dist function:-**** **calculates a distance matrix of the provided values and provides the eucledian distance between those values by default. from the calculated eucladien distance hierarchical clustering can be derived, **to perform this we use the hclust function**.

The **h****clust function** has a method attributes that specifies hows the clustering is to be done. The method includes **average**, **gord**, **single**, **median complete** and **centroid** methods. The complete linkage method being the default.

**Steps to make hierarchical clustering in R**

**Step1:** First, we load the dataset in R workspace and saved it in variable name- data

`survey<-read.csv("survey.csv", header=TRUE)`

>

**Step2:** the syntax to perform hierarchical cluster is hclust of dist of the dataset name

`surveyclust<-hclust(dist(survey[-1]))`

Saving the hclust in variable name surveyclust **-1 is to remove the first column i.e country name, since it does not have the logical relationship with the data.**

**step3:** plot the denddogram

`plot(surveyclust)`

clustering result variable to plot the dendrogram

**the numbers you are seeing on dendrogram plot is country in the table**

countries are plot based on their similarities

**Step 4:** we can also make clusters from these dendrograms using

`rect.hclust(survey, 5)`

model name and number of argument(number of clusters)

the dendrogram now will show 5 clusters in color.

`rect.hclust(survey, 4)`

the dendrogram now will show 4 clusters in color.

The post Heirarchical Clustering Techniques using R appeared first on StepUp Analytics.

]]>The post Types of Cluster Analysis and Techniques using R appeared first on StepUp Analytics.

]]>- Definition
- Types
- Techniques to form cluster method

**Definition:**

- It groups the similar data in same group.
- The goal of this procedure is that the objects in a group are similar to one another and are different from the objects in other groups.
- Greater the similarity within a group and greater difference between the groups, more distinct the clustering.
- Cluster analysis provides a potential relationship and construct systematic structure in large number of varables and observations.

Main objectives of clustering are:

- Intra-cluster distance is minimized.
- Inter-cluster distance is maximized.

**Hierarchical clustering:**Also known as ‘nesting clustering’ as it also clusters to exist within bigger clusters to form a tree.**Partition clustering:**Its simply a division of the set of data objects into non-overlapping clusters such that each objects is in exactly one subset.**Exclusive Clustering:**They assign each value to a single cluster.**Overlapping Clustering:**It is used to reflect the fact that an object can simultaneously belong to more than one group.**Fuzzy clustering:**Every objects belongs to every cluster with a membership weight that goes between 0:if it absolutely doesn’t belong to cluster and 1:if it absolutely belongs to the cluster.**Complete clustering:**It perform a hierarchical clustering using a set of dissimilarities on ‘n’ objects that are being clustered. They tend to find compact clusters of an approaximately equal diameter.

**Techniques to form cluster method:**

- K-means
- Agglomerative hierarchical clustering
- DBSCAN.

Here in this article we will learn K-means clustering using R

**K-means:**

K Means Clustering is an **unsupervised learning algorithm** that tries to cluster data based on their similarity. **Unsupervised learning** means that there is no outcome to be predicted, and the algorithm just tries to find patterns in the data. In ‘k’ means clustering, we have the specify the number of clusters we want the data to be grouped into. The algorithm randomly assigns each observation to a cluster, and finds the centroid of each cluster. Then, the algorithm iterates through two steps:

- Reassign data points to the cluster whose centroid is closest.
- Calculate new centroid of each cluster.

These two steps are repeated till the within cluster variation cannot be reduced any further. The within cluster variation is calculated as the sum of the euclidean distance between the data points and their respective cluster centroids

The `iris`

dataset contains data about sepal length, sepal width, petal length, and petal width of flowers of different species. Let us see what it looks like:

library(datasets) head(iris)<em> Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa</em>

After a little bit of exploration, I found that `Petal.Length`

and `Petal.Width`

were similar among the same species but varied considerably between different species, as demonstrated below:

library(ggplot2) #this command will load the graphical package

ggplot(iris, aes(Petal.Length, Petal.Width, color = Species)) + geom_point()

Here

'iris': is the name of dataset

‘Petal.Length, Petal.Width’: are properties of species

‘color=species’: means different species will be in different color

geom_point(): this means output will be shown in dots.

here is graph

Here in this plot you can see the length and width of different species is almost same.

Okay, now that we have seen the data, let us try to cluster it. Since the initial cluster assignments are random, let us set the seed to ensure reproductibility.

set.seed(20) irisCluster <- kmeans(iris[, 3:4], 3, nstart = 20) irisCluster<em> K-means clustering with 3 clusters of sizes 50, 52, 48 Cluster means: Petal.Length Petal.Width 1 1.462000 0.246000 2 4.269231 1.342308 3 5.595833 2.037500 Clustering vector: [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 [75] 2 2 2 3 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 2 3 3 3 3 [112] 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 [149] 3 3 Within cluster sum of squares by cluster: [1] 2.02200 13.05769 16.29167 (between_SS / total_SS = 94.3 %) Available components: [1] "cluster" "centers" "totss" "withinss" "tot.withinss" [6] "betweenss" "size" "iter" "ifault" > </em>

Since we know that there are 3 species involved, we ask the algorithm to group the data into 3 clusters, and since the starting assignments are random, we specify `nstart = 20`

. This means that R will try 20 different random starting assignments and then select the one with the lowest within cluster variation.

We can see the cluster centroids, the clusters that each data point was assigned to, and the within cluster variation.

Let us compare the clusters with the species.

table(irisCluster$cluster, iris$Species)<em> setosa versicolor virginica 1 50 0 0 2 0 48 4 3 0 2 46 </em>

As we can see, the data belonging to the `setosa`

species got grouped into cluster 1, `versicolor`

into cluster 2, and `virginica`

into cluster 3. The algorithm wrongly classified two data points belonging to `versicolor`

and six data points belonging to `virginica`

.

We can also plot the data to see the clusters:

irisCluster$cluster <- as.factor(iriscluster$cluster) ggplot(iris, aes(Petal.Length, Petal.Width, color = iriscluster$cluster)) + geom_point()

Here is the plot:

That brings us to the end of the article. I hope you enjoyed it! If you have any questions or feedback, feel free to leave a comment.

source: click here

The post Types of Cluster Analysis and Techniques using R appeared first on StepUp Analytics.

]]>