tags: [AI/Tasks/Clustering, AI/Algorithms/K-Means]

K-Means

K-Means is an Unsupervised Learning algorithm for Clustering which groups unlabeled data into different clusters( is an integer over 2 and is defined by user).

K-Means is trained on a dataset of items and and after number of cluster centroids are randomly assigned as , through an iterative process creates cohesive clusters in such a way that data points with similar attributes belong to the same cluster.

Info

K-Means assumes that the closes the data points are, the more similar they are. Often Euclidean distance is used to measure this distance.

Algorithm:

Select number of clusters as .
Initialize cluster centroids as
Repeat until convergence(When centroids don’t move in last two iteration), or a maximum number of iterations has been reached:
1. Measure distance between points and centroids. for every set:
2. Assign each data item( ) to closest cluster centroid().
3. Identify center points of each cluster. for every set:
  
  ℹ️ This will move each cluster centroid() to the center of clusters assigned to it

Advantages:

It is efficient and also fast and easy to train.
It’s relatively simple and easy to implement, scale, and adopt to new changes.
It guarantees convergence
It’s good at generalizing clusters of different sizes and round shapes such as elliptical clusters.

Disadvantages:

This algorithm does not find the optimal number of clusters.
It’s sensitive to initial conditions(Randomized centroids) so the results of the clustering are not consistent and can produce different results each time a dataset is trained. However initial positions of centroids can be manually declared.
K-Means is not good at handling oddly shaped data distributions as it shapes clusters to circular forms.
- Gaussian mixture model (GMM) algorithm is proposed as an alternative to K-Means to handle non-circular data.
- An example of oddly shaped data distribution:
It can’t handle data of varying sizes, density, and dimensions.
It’s not good at dealing with outliers and noise.

Applications:

Image Segmentation
Social Network Analysis
[[Customer Segmentation
Document Classification
Outlier Detection
Feature learning

To identify optimum number of clusters following methods are used:

Domain Knowledge
Elbow Curve: Utilizes WCSS (Within-Cluster Sum of Square) values(on the y-axis) corresponding to the different values of K(on the x-axis) to draw a graph and present an elbow shape in the graph which indicates the K-value where the elbow gets created.
Silhouette Method
Calculated as:
Where:
- : Average intra-cluster distance(the average distance between each point within a cluster).
- : average inter-cluster distance(the average distance between all clusters).
  The value of the Silhouette score ranges from -1 to 1:
- 1: Points are perfectly assigned in a cluster and clusters are easily distinguishable.
- 0: Clusters are overlapping.
- -1: Points are wrongly assigned in a cluster.

Resources:

Andrew Ng, CS229 Lecture Notes, K-Means

Code:
learning-repo/k_means at main · JacobBumgarner/learning-repo