Multicollinearity

Assumption that Features are not correlated (no collinearity). I.e. the effect of changes in one of the feature(samples) on the target variable does not depend on values of other features.

Info

The main problem in multicollinearity is that the presence of the correlated features will not add any new valuable information to the model, only increasing the Curse of Dimensionality.


Solutions to improve the condition of multicollinearity:

  • Correlation Matrix is used for detecting collinearity between two features.
  • Variance Inflation Factor (VIF) is often used to identify multicollinearity.
  • Regularization techniques can help reduce the coefficients of the feature that are multicollinear. Lasso Regression is often used for this purpose as it uses L1 Regularization.
  • Principal Components Analysis (PCA)
  • Hierarchical Clustering can be used to handle multicollinearity by performing hierarchical clustering on spearman rank order coefficient and picking a single feature from each cluster based on a threshold.
  • One-Hot Encoding vectorizes the categorical features, which then can be selectively removed.
Tip

Adding more data to training set can remove the problems of multicollinearity but has it's own challenges.


Note:

  • Multicollinearity is a common problem in Regression Models due to causing inaccuracy in predictions by increasing model complexity and Overfitting.