The k-means clustering method is an unsupervised machine learning technique used to identify clusters of data objects in a dataset. There are many different types of clustering methods, but k-means is one of the oldest and most approachable. Source: https://realpython.com/k-means-clustering-python/
K-means is an iterative algorithm that is applied to the objects which can be represented by the points in an n-dimensional space.
Each point belongs to only one cluster group. It tries to make the intra-cluster data points as similar as possible while also keeping the clusters as different (far) as possible.
It assigns data points to a cluster such that the sum of the squared distance between the data points and the cluster’s centroid (arithmetic mean of all the data points that belong to that cluster) is at the minimum.
The less variation we have within clusters, the more homogeneous (similar) the data points are within the same cluster.
K-means minimizes the total squared distance between each input point (xi) and its cluster center (cj):
It is an iterative algorithm that first assigns points to clusters and then recomputes the centers of those clusters until finding an optimum solution.
Source: https://towardsdatascience.com/k-means-clustering-algorithm-applications-evaluation-methods-and-drawbacks-aa03e644b48a
1. Euclidean distance
The most common
2. Manhattan distance
Approximation to Euclidean distance and cheaper to computeSum of the absolute differences of their Cartesian coordinates
3. Minkowski distance
A generalization of both the Euclidean & Manhattan distance
1. Initialization
Aritary initialization of the centers.
2. Data Assignment
Each point is assigned to its closest cluster (center). Ties are broken by randomly assigning the point to one of the clusters. This yields partiononing of the data.
3. Relocation of the centers
Each cluster representative is moved to the center (arithmetic mean) of the points assigned to it.
4. Repeat 2 and 3 until the centers do not longer change
Limitations
1. The algorithm is sensitive to the outliers- The cluster centers are computed using the mean function that is sensitive to the outliers.
2. There is a risk of empty clusters for high dimensional data- There are no points close to the chosen center.
3. It is quite sensitive to the intial set-up (location of the cluster centers) that might lead to finding local minima instead of the absolute minima.
To solve this, it is better to run it multiple times (different initializations) or look for a
robust way of initializing the algorithm.
4. Choosing the value of k is difficult unless one is having prior knowledge about the natural number of groups present in the data.
Methods to select Optimal K in K-means
1. Elbow Method
This is probably the most well-known method for determining the optimal number of clusters.
Calculate the Within-Cluster-Sum of Squared Errors (WSS) for different values of k, and choose the k for which WSS becomes first starts to diminish. In the plot of WSS-versus-k, this is visible as an elbow.
2. Silhouette Method
The silhouette value measures how similar a point is to its own cluster (cohesion) compared to other clusters (separation).
The range of the Silhouette value is between +1 and -1. A high value is desirable and indicates that the point is placed in the correct cluster. If many points have a negative Silhouette value, it may indicate that we have created too many or too few clusters.
The Silhouette Value s(i) for each data point i is defined as follows:
s(i) is defined to be equal to zero if i is the only point in the cluster. This is to prevent the number of clusters from increasing significantly with many single-point clusters.
a(i) is the measure of similarity of the point i to its own cluster. It is measured as the average distance of i from other points in the cluster.
Similarly, b(i) is the measure of dissimilarity of i from points in other clusters.
d(i, j) is the distance between points i and j. Generally, Euclidean Distance is used as the distance metric.