Understand, collect, and explore data needed to solve the problem
Scoping
Define problem: Identifying real world problems
Brainstorm AI solutions
Assess feasibility or potential solutions:
Evaluate human level performance
Evaluate competitor or benchmark results
Explore available data and features for prediction
Determine milestone
Budget for resources
Data Collection and storage: often a Data Pipeline is used to gather and prepare data for Machine Learning tasks. Data size, and type can impact both model development and operation.
Data Journey: Process of change in data. Tracking it is necessary to recreate, compare, or explain ML models.
Data provenance(data lineage): it is the tracking of the series of transformations in the evolution of data and models from raw input to output artifacts.
Data Versioning: Data only reflects a snapshot of the world when the data was gathered. Data is expected to change over time. It is vital to version our data along with the code and runtime parameters we typically track.
Feature Stores(Feature Repositories): It is a central repository for documented, curated, and access-controlled data features that teams can share, discover and use for model training and serving. It reduce redundant and provides unified, consistent, and persistent means of managing data features that are performant and scalable. it also ensures that training-serving skew is avoided.
Data Storage:
A database is an organized collection of data that allows easy access and retrieval.
A data warehouse is a central repository of information designed for analysis to drive informed decisions.
A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale.
Model processing: Building model (creation, training, evaluating and testing) while managing resources for optimal model performance.
Automation: it can be is several levels: Manual, full Automation, or Partial automation with a human in the loop
Model Analysis: After training and deploying a model, the next phase is to evaluate its performance.
Model Training
DevOps Monitoring: monitoring available resource consumption. E.g. CPU, GPU, memory, network requests and data transfer, etc.
Continuous Evaluation and Monitoring: It is essential to continuously monitor data and model performance by performing Model Evaluation to get early warnings.
Model Monitoring: determine several metrics that can show that something went wrong with the model.
Input metrics: Null value input, wrong type, out of range input, anomaly, etc.
Output metrics: Null output, low confidence in new tests, etc.
Software metrics: Latency, server load, etc.
problems:
concept drift
concept emergence
covariate shift
prior probability shift
Solutions:
Supervised techniques
Statistical process control
Sequential analysis (using linear four rates)
Error distribution monitoring (adaptive windowing)
Model-dependent monitoring (e.g., Margin Density Drift Detection).
Concerns:
High-Performance Modeling:
Distributed Training:
Data parallelism replicates models onto different accelerators (GPU or TPU) and splitting the data between them.
Model parallelism divides a large model (too big to fit on a single device) into partitions and assigning them to various accelerators.
High-Performance Ingestion: Accelerators (GPU/TPU) are vital for high-performance modeling, but they are expensive and must be used efficiently. This efficiency is maintained by supplying accelerators with data fast enough to avoid staying idle and improve training time. Approaches:
Prefetching
Caching
Memory reduction
Parallelization of data extraction and transformation
Knowledge Distillation: The idea behind knowledge distillation is to create a simple 'student' model that learns from a more complex 'teacher' model. The goal is to duplicate the performance of a complex model into a simpler, more efficient model.
Data centric: focuses on high quality data. it’s more practical in business applications.
Model centric: focuses on optimizing mode. it’s the main focus of academics.
Model Training:
Model Analysis Metrics:
Aggregate metrics: assess performance across the entire dataset.
Sliced Metrics: assess performance at a granular level on individual data subsets.
Model Robustness: A model is considered robust if its results are consistently accurate, even if one or more features change relatively drastically.
Model Debugging: finding and fixing problems in models and improving model robustness.
Benchmarking models
Sensitivity analysis
Residual analysis
Model Updating: Adding new data to model. it’s done in two methods.
Batch Learning: Data is processed regularly, often using an schedule. it’s simple and easy, however the delay may not be acceptable in some applications.
Online Learning: New data is immediately processed and model updated. it’s challenging as model must always be up and running.