Missing Values

When no value is stored in a certain observation within a row, it's a missing value problem.
Missing values in data rows are problematic because most ML algorithms can’t handle null value. therefor either their column should be removed which reduces training data size, or they should be replaced with sensible and meaningful values which is called imputation.

Info

Missing values cause two problems:

  • Some algorithms can't work with missing values.
  • Models trained with missing values can reach inaccurate conclusions.

Understand effects of missing values:

  • Causes of missing values:
    • some measurements may be missing
    • lack of information
    • Transcript errors
  • Data source
  • effects of feature with missing values
  • meaning that missing values convey

Types if missing values:

  • Missing At Random(MAR): There is a pattern in the missing data that affect your primary dependent variables.
  • Missing Completely At Random(MCAR): There is no pattern in the missing data on any variables.
  • Not Missing At Random(NMAR): Often happens due to unobserved(not recorded) predictors or variables. There is a pattern in the missing data but not on your primary dependent variables such as likelihood to recommend.
Tip

The mechanisms by which missing fields are introduced in a dataset, can help us in choosing the best solution to handle them. Business Understanding or statistical tests can help us in assuming such variables.


Solutions:

  • Remove data item(rows) or feature(columns)
    • List-wise deletion: if MAR, NMAR, NCAR. ℹ️ in case of NCAR it preserves distribution.
    • Pair-wise deletion: if MCAR
  • Imputation (Fill missing values)
    • Average Imputation: Substitute with mean/median/mode if MCAR.
    • Common-Point Imputation: For a rating scale, using the middle point or most commonly chosen value(for Categorical Variables).
    • Expected Maximum: if MAR, MCAR
    • Maximum Likelihood: if MAR, MCAR
    • Regression Imputation: if MAR, MCAR
    • hot-deck imputation: if MCAR
      • ℹ️ Filling using observed values from the sample. If observation units are selected randomly them imputation technique is called random hot-deck imputation.
    • Substitute the missing values with zero or NA
  • leave as is