Beginners Guide to Statistical Cluster Analysis in Detail

In the Last part of Statistical Cluster Analysis series, we discussed Hierarchical cluster analysis(HCA). Which is the first method of data exploratory analysis techniques?
Here is the second part, we will cover the second part of data exploratory analysis techniques.
Non-Hierarchical Cluster Analysis(Non-HCA): 
Features:

  • Clusters do not have the hierarchy.
  • No distance matrix calculation.
  • Better suited for large dataset.

Non-HCA methods start from either-

  • With an initial partition of items/objects into groups

or

  • With an initial partition of the set of seed points, which will form the nuclei of initial clusters.

NOTE: One way to start is to randomly select seed points from among the items or to randomly partition the date(i.e. items/objects) into initial groups.
K-means method or the method of iterative relocation:- 

K-means is an algorithm that assigns each object to the cluster having the nearest centroid/nuclei

Algorithm:

  1. Partition the items/objects into ‘k’ initial letters.
  2. Reassign items/objects to the cluster whose centroid is nearest. Recalculate the centroid/nuclei for the cluster receiving the new item and for the cluster using the items/objects.
  3. Repeat step 2 until no more reassignment is possible.

NOTE:- Rather than starting with a partition of all items into k initial groups( as in step1 ) we can also assign ‘K’ initial centroid/nuclei(seed points) and then proceed to step 2 after a walk through the data.
The handwritten solved example is attached below go through the steps and try to understand if not please shoot me an email @ irrfankhann29@gmail.com.
statistical non-hierarchical cluster analysis This is the pdf file of the example
CLUSTER CRITERIA:- Comparing different partitions-
Objective:- is to have a criterion for optimum partition of the data such that given set of cases of given clusters problem reduces to partition the data into ‘g’ clusters so that the clustering criterion is optimized.

  • Let the ‘n’ data points(cases) be : x1, x1, – – – – – x(n).
  • The sample variance – covariance is matrix is given
    • ∑ = (1/n)∑(x(i)-m)(x(i)-m)^t  : {m=(1/n)∑x(i) -> sample mean}
  • Let there be ‘g’ clusters and define

 

 

We can write as following

 

 

 

 

Then the within-cluster sum of square (SS) & cross product matrix

 

 

 

==> pooled within-cluster scatter matrix ‘g’ cluster
The between cluster SS and cross product matrix

 

 

 

 

 

 

Popular clustering criterion is based on univariate function S(b), S(w) or ∑.
Will share the criteria in the next part of the cluster analysis series. Till then stay tuned or practice your skills on cluster analysis, if you get any doubts please ask me by shooting an email @ irrfankhann29@gmail.com.
Article Originally posted Here 

You might also like More from author