K-means

The k-means clustering method is an unsupervised machine learning technique used to identify clusters of data objects in a dataset. There are many different types of clustering methods, but k-means is one of the oldest and most approachable. Source: https://realpython.com/k-means-clustering-python/

Introduction

K-means is an iterative algorithm that is applied to the objects which can be represented by the points in an n-dimensional space.

Each point belongs to only one cluster group. It tries to make the intra-cluster data points as similar as possible while also keeping the clusters as different (far) as possible.

It assigns data points to a cluster such that the sum of the squared distance between the data points and the cluster’s centroid (arithmetic mean of all the data points that belong to that cluster) is at the minimum. 

The less variation we have within clusters, the more homogeneous (similar) the data points are within the same cluster.

K-means minimizes the total squared distance between each input point (xi) and its cluster center (cj):

                                

It is an iterative algorithm that first assigns points to clusters and then recomputes the centers of those clusters until finding an optimum solution.

Source: https://towardsdatascience.com/k-means-clustering-algorithm-applications-evaluation-methods-and-drawbacks-aa03e644b48a

Explanation

1. Euclidean distance
The most common

2. Manhattan distance
Approximation to Euclidean distance and cheaper to computeSum of the absolute differences of their Cartesian coordinates

3. Minkowski distance
A generalization of both the Euclidean & Manhattan distance

How to

1. Initialization

Aritary initialization of the centers.

2. Data Assignment

Each point is assigned to its closest cluster (center). Ties are broken by randomly assigning the point to one of the clusters. This yields partiononing of the data.

3. Relocation of the centers

Each cluster representative is moved to the center (arithmetic mean) of the points assigned to it.

4. Repeat 2 and 3 until the centers do not longer change

Limitations

1. The algorithm is sensitive to the outliers- The cluster centers are computed using the mean function that is sensitive to the outliers.

2. There is a risk of empty clusters for high dimensional data- There are no points close to the chosen center.

3. It is quite sensitive to the intial set-up (location of the cluster centers) that might lead to finding local minima instead of the absolute minima.

To solve this, it is better to run it multiple times (different initializations) or look for a
robust way of initializing the algorithm.

4. Choosing the value of k is difficult unless one is having prior knowledge about the natural number of groups present in the data.

Methods to select Optimal K in K-means

1. Elbow Method

This is probably the most well-known method for determining the optimal number of clusters.

Calculate the Within-Cluster-Sum of Squared Errors (WSS) for different values of k, and choose the k for which WSS becomes first starts to diminish. In the plot of WSS-versus-k, this is visible as an elbow.

  • The Squared Error for each point is the square of the distance of the point from its representation i.e. its predicted cluster center.
  • The WSS score is the sum of these Squared Errors for all the points.
  • Any distance metric like the Euclidean Distance or the Manhattan Distance can be used.

2. Silhouette Method

The silhouette value measures how similar a point is to its own cluster (cohesion) compared to other clusters (separation).

The range of the Silhouette value is between +1 and -1. A high value is desirable and indicates that the point is placed in the correct cluster. If many points have a negative Silhouette value, it may indicate that we have created too many or too few clusters.

The Silhouette Value s(i) for each data point i is defined as follows:

s(i) is defined to be equal to zero if i is the only point in the cluster. This is to prevent the number of clusters from increasing significantly with many single-point clusters.

a(i) is the measure of similarity of the point i to its own cluster. It is measured as the average distance of i from other points in the cluster.

Similarly, b(i) is the measure of dissimilarity of i from points in other clusters.


d(i, j) is the distance between points i and j. Generally, Euclidean Distance is used as the distance metric.

Outgoing relations