High Cardinality

A high number of labels within a variable is known as High Cardinality, which will lead in having too many unique variables.

I.e.

Variables with too many labels increasing uniqueness of data.


Problems caused by high cardinality:

  • Variables with high cardinality can dominate other variables in dataset, negatively effecting the performance of model. It's specially common in Tree-based Models.
  • High number of unique values can be considered noise as they add little credible information to datasets, and making it prone to Overfitting.
  • Unique variables can be present in only one of training or testing datasets.

Solutions to high cardinality: