Open map List

Kmeans

It is an iterative algorithm that assigns each data point to one of the K clusters based on their proximity to the cluster centroids.

Explanation

K-means is a clustering algorithm used in machine learning and data analysis. Its primary goal is to partition a given dataset into K clusters, where K is a predetermined number chosen by the user.

The steps are as follows:

Initialization: Randomly select K points from the dataset as initial cluster centers.
Assignment: For each data point, compute the distance to each cluster center and assign it to the cluster with the closest center. This step creates initial clusters.
Update: Recalculate the cluster centers by computing the mean of the data points assigned to each cluster. This step moves the cluster centers to the center of their respective clusters.
Repeat steps 2 and 3 until convergence: Iteratively reassign data points to clusters and update the cluster centers. Convergence occurs when the assignment of data points to clusters remains unchanged or falls below a specified threshold.

The final outcome of the K-means algorithm is a set of K clusters, where each data point is assigned to a specific cluster. The algorithm optimizes the cluster centers to minimize the sum of squared distances between data points and their assigned cluster centers. K-means aims to create compact, well-separated clusters based on the similarity of data points.

K-means limitations

The algorithm is sensitive to outliers: Cluster centres are computed using the "mean" function which is sensitive to outliers.

It is sensitive to the

Each iteration requires N*k comparisions, where:

N is the number of data points in your dataset.
k is the number of clusters you're trying to identify.

It is quite sensitive to the initial set-up (location of cluster centers) which
might lead to finding local minima instead of the absolute minimum.
Better to run multiple times (different initializations) or look for a
robust way of initializing the algorithm.

It takes a long time for large datasets.

Choosing the value of k is difficult. (Silhoutte method, Elbow mehod, EDA?)

A better way to initialize

Choosing the centers one by one in a controlled fashion.
k-means++ algorithm selects only the first center uniformly at random from the
data.
Each subsequent center is selected with a probability proportional to its contribution to the overall error given the previous selections.

Kmeans

Explanation

Examples

Outgoing relations

Incoming relations

Kmeans

Explanation﻿

Examples﻿

Outgoing relations

Incoming relations

Explanation

Examples