Clustering is grouping collections of unlabeled data into a number of clusters based on similarity of data items.

Clustering Methods

  • Centroid-based Clustering: Is a non-hierarchical clustering method where centroids for a specific number of clusters is defined and distance to it is used to group data items.
    • Algorithms
  • Distribution-based Clustering: This method is used in data which is composed of distributions; where the distance from the distribution's center indicates the probability of item belonging to the distribution.
  • Density Method: It identifies and groups data points in areas of high concentrations together, assuming that they have more similarities and differences than points in a lower dense region.
    This method can take advantage of Kernel Density Estimation(KDE), also called Probability Density Function(PDF), to estimate the underlying distribution of data.
    • ✔️ this method has a good accuracy
    • ✔️ It has the ability to merge clusters
    • ✔️ Creates arbitrary-shaped distributions for dense areas
    • ✔️ It’s able to find outliers.
    • ❌ Is weak high dimensional data
    • Algorithms
  • Hierarchical Method: It forms first clusters in a tree-type structure, then creating new clusters from previously formed clusters.
    • Algorithms
      • BIRCH
      • CURE
      • Agglomerative Hierarchy clustering algorithm
  • Partitioning Method: It partitions the objects into k clusters and each partition forms one cluster.
    • Algorithms
      • CLARANS
  • Grid-based Method: formulates the data into a finite number of cells that form a grid-like structure.
    • Algorithms
      • CLIQUE
      • STING
  • Graph-Based Methods: Utilizes graph Theory and treats data items as nodes and their connections in their edges as a measure of similarity.
    • Algorithms
      • Spectral Clustering

Resources: