K – Means Clustering Algorithm

“What gets measured, gets managed ” – Peter Drucker

Table of Contents

  • Introduction
  • Types of clustering
  • When to use it?
  • Pseudocode
  • How does it work?
  • Distance measures
  • Important hyper-parameters
  • Choosing the optimal k value
  • Picking the initial k points or the centroids
  • Time complexity
  • Implementing k-means in Python
  • Advantages and Disadvantages
  • Applications

Introduction to K Means Clustering

The most important aim of all the clustering techniques is to group together the similar data points. k-means clustering algorithm also serves the same purpose. K-means clustering algorithm is an unsupervised machine learning algorithm. It is a method of vector quantization that aims at grouping the similar data by minimizing the squared error function.

Clustering Algorithms Differ On:

  1. Convergence criteria
  2. Cluster assignment of data points
  3. Similarity measure(Which data-points are similar)

You can apply k-means to any clustering problem provided you are having proper feature vector (Vector-Space Model) from data points and a similarity/distance measure that can measure similarity/distance between the feature vectors.

Types of clustering

When to use it?

The k-means clustering algorithm is used when you have unlabeled data (i.e., data without defined categories or groups). It gives reliable results when the datasets are distinct or well separated in space in a linear fashion because the algorithm does not work well for overlapping dataset or non-linear dataset points. It can also be used when the number of clusters i.e the k value is specified for the well-defined data.

Pseudo-code

Input:- k i.e, the number of clusters and the set of points (x1,x2,….,xn).

  1. Select the ‘k’ value i.e, the number of clusters you want to identify.
  2. Choose randomly ‘k’ data points as centroids (c1, c2,…,ck)from the vector space.
  3. Repeat until convergence :
    1. for each data point xi :
      1. find the nearest centroid cj using Euclidian distance.
      2. assign xi to that nearest cluster j.
    2. for each cluster find the new centroid which is the mean of all the xi points assigned to that cluster j.
  4. Terminate when none of the cluster assignment change.

You can also stop depending upon the iteration budget which can yield can approximate outcomes.

How does it work?

Now let’s take an example to comprehend better. Assume a vector space with some data points and let k=2. This implies you have to group these data points into 2 clusters.

Choose any 2 data points randomly as the centroids say C1 and C2. Yellow data points be one cluster and the red data points be another.

Calculate the Euclidian distance of each data point with centroid C1 and C2 and assign them to the nearest cluster (i.e, red or yellow).

Take the mean of all the points in both the cluster and assign that mean vector point as the new centroid.

Now there’s no more cluster change assignment. So this is the final 2 clusters.

Note: Do not ignore the outliers. Define a strict definition for it. The presence of an outlier can lead to the selection of a different initial centroid.

Distance measures

Euclidian distance: it is the ordinary distance between two points.

Squared Euclidian distance: it is the square of the Euclidian distance.

Manhattan distance: it is the sum of horizontal and vertical components or the distance between two points measured along the axes at right angles.

Cosine distance: the cosine distance similarity measures the angle between two vectors.

Important Hyper-parameters

class sklearn.cluster.KMeans(n_clusters=8init=’k- means++’n_init=10max_iter=300tol=0.0001precompute_distances=’auto’verbose=0random_state=Nonecopy_x=Truen_jobs=1algorithm=’auto’)

  • n_clusters: optional, default: 8.
    • The number of clusters to be formed.
  • init: {‘k-means++’, ‘random’ or an ndarray}
    • the default is ‘k-means++’:
    • If init = ‘k-means++’: selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. See section Notes in k_init for more details.
    • If init = ‘random’: choose k observations (rows) at random from data for the initial centroids.
    • If init = ‘ndarray’, it should be of a shape (n_clusters, n_features) and gives the initial centers.
  • n_init: default: 10
    • The number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia.
  • max_iter: default: 300
    • Tthe number of iterations of the k-means algorithm in a single run.

Choosing the optimal k value
Sometimes finding the optimal k value could be difficult. For a non-separated dataset, you need to break a sweat to find the k value. Well, there are two ways of finding the optimal k value.

Try different values of k: Start with k=1 and keep incrementing the k value. Quantify the result based on total variation and select the best k value. (K=1 is the worst case scenario). In other words, you are comparing the mean distance between the data points and the centroid. Increasing the k value will decrease this metric distance and increment it until ‘k’ is not equal to the number of data points.

Mean distance is plotted against the ‘k’ value and the “elbow point,” where the rate of decrease sharply shifts, can be used to roughly determine K. This method is popularly known as “Elbow method”.

Picking the initial k points or the centroids

­Choosing the right initial centroid is very important because different initial centroids can give you different clusters. So the question is how to choose these initial centroids. Well, there are many methods of choosing these points. They are classified as actual sample starting points and synthetic starting points.

Synthetic Initial Starting Points

A synthetic initial starting point is a point (i.e., a synthetic feature vector) that does not correlate to any specific observation in the actual workload.

Most commonly used synthetic methods are scrambled midpoints, unscrambled midpoints, and scrambled medians.

1. Scrambled midpoints: the algorithm is as follows :

  • Divide the range of (scaled) feature values into k equal sized partitions. For example, if a feature ranges from –2.0 to 4.0 and k=3, then the three partitions are [-2.0, 0], [0, 2.0] and [2.0, 4.0]. This partitioning is done for each feature.
  • for each feature, take the midpoint of each partition as a potential initial starting feature value.
  • For each feature, randomly select one of the partition’s midpoints to include in the starting point. Do this k times to construct the needed k starting points.

As different combinations of the different feature yield different error values, the k-means clustering algorithm is run multiple times for each k. The lowest error value for each k is used when constructing the Ek curve.

2. Unscrambled midpoints: it is same as scrambled midpoints method but the only exception is that in step c, features are combined based on the partition they are in and There is no need to run the algorithm multiple times when using this method.

3. Scrambled median: it is also same as the scrambled midpoints algorithm with the only exception that in step b, for each feature the median value of each partition is used as a potential initial starting feature value. The algorithm is run multiple times for each k and the lowest error value for each k is used for making the Ek curve.

The following figure tells that scrambled midpoints method outperforms the other two algorithms because it consistently gives lower error values. Also, scrambled methods are superior to the unscrambled method.

4. Selecting ‘dispersed’ set of points :

  1. From n objects calculate a point whose attribute values are average of n-objects attribute values. so first initial centroid is average of n-objects.
  2. Select next initial centroids from n-objects in such a way that the Euclidean distance of that object is maximum from previously selected initial centroids.
  3. Repeat step 2 until we get k initial centroids.

Time Complexity

In order to analyze the time complexity of this algorithm, you need to identify the operations performed at each step and in each iteration of the algorithm. The operations in each iteration are:

  • To calculate the distance from a point to the centroid, we can use the squared Euclidean proximity function. It involves two subtractions, one summation, two multiplications and one square – root operations, i.e., 6-operations.
  • Comparisons between distances
  • Calculation of centroids

So, the total number of operations for an iteration
= distance calculation operations + comparison operations + centroids calculation operations
= 6*[k*m*n] operations + [(k-1)*m*n] operations + [k*((m-1) + 1)*n] operations

Assume that the algorithm converges after I iterations, then total number of operations for the algorithm
= 6*[I*k*m*n] operations + [I*(k-1)*m*n] operations + [I*k*((m-1) + 1)*n] operations
= [6Ikmn + I(k-1)mn + Ikmn] operations
≈ O (I*k*m*n)

Implementing k-means in Python

Dataset: Clara.csv

Advantages and Disadvantages

Advantages :

  • Easy to implement.
  • It is fast and robust.
  • Gives better result when data is distinct and well separated.
  • Produces more compact clusters than hierarchical clustering.
  • Often terminates at local optima.
  • In case of a large number of variables, k-means is computationally faster than hierarchical clustering provided k is a small value.

Disadvantages :

  • Difficult to predict k value.
  • Different initial partitions can result in different clusters.
  • Sensitive to the order in which data points are explored.
  • Sensitive towards scaling: rescaling the data might change the result.
  • Sensitive to outliers and noise.
  • Fails in case of heavily overlapping data.
  • Applicable, only when mean, is defined i.e. fails for categorical data.
  • Unfit for the non-linear dataset.

Applications

  • Search engines: Clustering algorithm is used to categorize similar groups and gives you the required result on the front page.
  • Wireless sensor network based applications: landmine detection.
  • Academics: this algorithm can be used in academics to group the students into various clusters based on their performances. By knowing the number of students’ in each cluster we can know the average performance of a class as a whole.
  • Drug activity prediction

Note: Next Topic you should look at should be Random Forest

You might also like More from author