Data Quality

Data Quality is the measure of usefulness of data for end-user or data processing software. Data is considered high quality if it's fit for it's intended use; So it should have desired features, and be without defects.

The quality of the predictions directly corresponds to the quality of data you train the model with.

  • Data quality is better than data quantity.
  • Data quality improvement, often works better that model improvement.

Traits of high quality data:

  • Free from biases
  • No leaked features
  • Independent samples
  • High predictive power
  • No duplicated samples
  • Train & test sets have the same probability distributions

Solving Data Quality Problems:

  • Data Profiling: Data profiling analyzes data sets; it helps us tackle bad data quality by identifying patterns, outliers, and other key characteristics in a data set and seeing which variables are most predictive for any given outcome. These profiles are also helpful for identifying which variables contain missing values and filling in null values or other gaps.
  • Data Contracts: A typical contract is an agreement between two parties, so a data contract is an agreement between the developers and customers of that data. They agree on the values received and sent as part of the data.
  • Anomaly detection: An anomaly can be defined as any event that deviates from normal behavior. In other words, it can be seen as an event that does not conform to what one would expect to find in a normal distribution or pattern.