Data Cleaning

Includes checking data for errors and inconsistencies that can lower data quality and reliability, then correcting them. Includes checking data for errors and inconsistencies that can lower data quality and reliability, then correcting them.

Tasks:

  • Tidy data: make sure each variable is in it’s specific column and each data item in it’s own row.
  • Check and remove duplicate values
  • Handle Missing Values
  • Clean Noisy Data
    • Binning Method: Continuous variables with many infrequently occurring values will be aggregated into groups of similar values resulting in a new categorical feature to reduce overfitting.
    • Regression
    • Clustering

Data Cleaning With pandas and NumPy - Real Python