Data Journey: Process of change in data. Tracking it is necessary to recreate, compare, or explain ML models.
Data provenance(data lineage): it is the tracking of the series of transformations in the evolution of data and models from raw input to output artifacts.
Data Versioning: Data only reflects a snapshot of the world when the data was gathered. Data is expected to change over time. It is vital to version our data along with the code and runtime parameters we typically track.
Feature Stores(Feature Repositories): It is a central repository for documented, curated, and access-controlled data features that teams can share, discover and use for model training and serving. It reduce redundant and provides unified, consistent, and persistent means of managing data features that are performant and scalable. it also ensures that training-serving skew is avoided.
A database is an organized collection of data that allows easy access and retrieval.
A data warehouse is a central repository of information designed for analysis to drive informed decisions.
A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale.
Model processing: Building model (creation, training, evaluating and testing) while managing resources for optimal model performance.
Model-dependent monitoring (e.g., Margin Density Drift Detection).
Data parallelism replicates models onto different accelerators (GPU or TPU) and splitting the data between them.
Model parallelism divides a large model (too big to fit on a single device) into partitions and assigning them to various accelerators.
High-Performance Ingestion: Accelerators (GPU/TPU) are vital for high-performance modeling, but they are expensive and must be used efficiently. This efficiency is maintained by supplying accelerators with data fast enough to avoid staying idle and improve training time. Approaches:
Parallelization of data extraction and transformation
Knowledge Distillation: The idea behind knowledge distillation is to create a simple 'student' model that learns from a more complex 'teacher' model. The goal is to duplicate the performance of a complex model into a simpler, more efficient model.