It is an iterative algorithm that assigns each data point to one of the K clusters based on their proximity to the cluster centroids.
K-means is a clustering algorithm used in machine learning and data analysis. Its primary goal is to partition a given dataset into K clusters, where K is a predetermined number chosen by the user.
The steps are as follows:
Initialization: Randomly select K points from the dataset as initial cluster centers.
Assignment: For each data point, compute the distance to each cluster center and assign it to the cluster with the closest center. This step creates initial clusters.
Update: Recalculate the cluster centers by computing the mean of the data points assigned to each cluster. This step moves the cluster centers to the center of their respective clusters.
Repeat steps 2 and 3 until convergence: Iteratively reassign data points to clusters and update the cluster centers. Convergence occurs when the assignment of data points to clusters remains unchanged or falls below a specified threshold.
The final outcome of the K-means algorithm is a set of K clusters, where each data point is assigned to a specific cluster. The algorithm optimizes the cluster centers to minimize the sum of squared distances between data points and their assigned cluster centers. K-means aims to create compact, well-separated clusters based on the similarity of data points.
K-means limitations
The algorithm is sensitive to outliers: Cluster centres are computed using the "mean" function which is sensitive to outliers.
It is sensitive to the
Each iteration requires N*k comparisions, where:
It is quite sensitive to the initial set-up (location of cluster centers) which
might lead to finding local minima instead of the absolute minimum.
Better to run multiple times (different initializations) or look for a
robust way of initializing the algorithm.
It takes a long time for large datasets.
Choosing the value of k is difficult. (Silhoutte method, Elbow mehod, EDA?)
A better way to initialize