K-Means

K-Means is an Unsupervised Machine Learning algorithm for Clustering which groups unlabeled data into different clusters( is an integer over 2 and is defined by user).

K-Means is trained on a dataset of items and and after number of cluster centroids are randomly assigned as , through an iterative process creates cohesive clusters in such a way that data points with similar attributes belong to the same cluster.

Info

K-Means assumes that the closes the data points are, the more similar they are. Often Euclidean distance is used to measure this distance.


Algorithm:

  1. Select number of clusters as .

  2. Initialize cluster centroids as

  3. Repeat until convergence(When centroids don’t move in last two iteration), or a maximum number of iterations has been reached:

    1. Measure distance between points and centroids. for every set:

    2. Assign each data item( ) to closest cluster centroid().

    3. Identify center points of each cluster. for every set:

      ℹ️ This will move each cluster centroid() to the center of clusters assigned to it


Advantages:

  • It is efficient and also fast and easy to train.
  • It’s relatively simple and easy to implement, scale, and adopt to new changes.
  • It guarantees convergence
  • It’s good at generalizing clusters of different sizes and round shapes such as elliptical clusters.

Disadvantages:

  • This algorithm does not find the optimal number of clusters.
  • It’s sensitive to initial conditions(Randomized centroids) so the results of the clustering are not consistent and can produce different results each time a dataset is trained. However initial positions of centroids can be manually declared.
  • K-Means is not good at handling oddly shaped data distributions as it shapes clusters to circular forms.
    • Gaussian Mixture Model (GMM) algorithm is proposed as an alternative to K-Means to handle non-circular data.
    • An example of oddly shaped data distribution:oddly-shaped-data.jpg
  • It can’t handle data of varying sizes, density, and dimensions.
  • It’s not good at dealing with outliers and noise.

Applications:


To identify optimum number of clusters following methods are used:

  • Domain Knowledge
  • Elbow Curve: Utilizes WCSS (Within-Cluster Sum of Square) values(on the y-axis) corresponding to the different values of K(on the x-axis) to draw a graph and present an elbow shape in the graph which indicates the K-value where the elbow gets created.
  • Silhouette Method
    Calculated as:
    Where:
    • : Average intra-cluster distance(the average distance between each point within a cluster).
    • : average inter-cluster distance(the average distance between all clusters).
      The value of the Silhouette score ranges from -1 to 1:
    • 1: Points are perfectly assigned in a cluster and clusters are easily distinguishable.
    • 0: Clusters are overlapping.
    • -1: Points are wrongly assigned in a cluster.

Resources:

Code:
learning-repo/k_means at main · JacobBumgarner/learning-repo