Machine Learning Development Lifecycle (MLDLC)
Metadata
MLDLC MLOps Life-Cycle

Process:

  1. Project Initiation
    1. Define business problem to be solved
    2. Design architecture and choose technologies
    3. Derive ML problem from business problem
    4. Understand, collect, and explore data needed to solve the problem
  2. Scoping
    1. Define problem: Identifying real world problems
    2. Brainstorm AI solutions
    3. Assess feasibility or potential solutions:
      • Evaluate human level performance
      • Evaluate competitor or benchmark results
      • Explore available data and features for prediction
    4. Determine milestone
    5. Budget for resources
  3. Data Collection and storage: often a Data Pipeline is used to gather and prepare data for Machine Learning tasks. Data size, and type can impact both model development and operation.
    • Data Mining
    • Data Preparation
    • Feature Selection
    • Data Storage & Data Journey
      • Data Journey: Process of change in data. Tracking it is necessary to recreate, compare, or explain ML models.
      • Data provenance(data lineage): it is the tracking of the series of transformations in the evolution of data and models from raw input to output artifacts.
      • Data Versioning: Data only reflects a snapshot of the world when the data was gathered. Data is expected to change over time. It is vital to version our data along with the code and runtime parameters we typically track.
      • Feature Stores(Feature Repositories): It is a central repository for documented, curated, and access-controlled data features that teams can share, discover and use for model training and serving. It reduce redundant and provides unified, consistent, and persistent means of managing data features that are performant and scalable. it also ensures that training-serving skew is avoided.
      • Data Storage:
        • A database is an organized collection of data that allows easy access and retrieval.
        • A data warehouse is a central repository of information designed for analysis to drive informed decisions.
        • A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale.
  4. Model processing: Building model (creation, training, evaluating and testing) while managing resources for optimal model performance.
  5. Production Process: Serving, maintaining, monitoring and debugging the model.
    1. Machine Learning Pipeline including:
      • Automation: it can be is several levels: Manual, full Automation, or Partial automation with a human in the loop
      • Model Analysis: After training and deploying a model, the next phase is to evaluate its performance.
    2. Model Training
    3. DevOps Monitoring: monitoring available resource consumption. E.g. CPU, GPU, memory, network requests and data transfer, etc.
    4. Continuous Evaluation and Monitoring: It is essential to continuously monitor data and model performance by performing Model Evaluation to get early warnings.
      • Model Monitoring: determine several metrics that can show that something went wrong with the model.
        • Input metrics: Null value input, wrong type, out of range input, anomaly, etc.
        • Output metrics: Null output, low confidence in new tests, etc.
        • Software metrics: Latency, server load, etc.
      • problems:
        • concept drift
        • concept emergence
        • covariate shift
        • prior probability shift
      • Solutions:
        • Supervised techniques
          • Statistical process control
          • Sequential analysis (using linear four rates)
          • Error distribution monitoring (adaptive windowing)
        • Unsupervised techniques
          • Clustering/novelty detection (e.g., OLINDDA, MINAS)
          • Feature distribution monitoring
          • Model-dependent monitoring (e.g., Margin Density Drift Detection).

Concerns:

  • High-Performance Modeling:
    • Distributed Training:
      • Data parallelism replicates models onto different accelerators (GPU or TPU) and splitting the data between them.
      • Model parallelism divides a large model (too big to fit on a single device) into partitions and assigning them to various accelerators.
    • High-Performance Ingestion: Accelerators (GPU/TPU) are vital for high-performance modeling, but they are expensive and must be used efficiently. This efficiency is maintained by supplying accelerators with data fast enough to avoid staying idle and improve training time. Approaches:
      • Prefetching
      • Caching
      • Memory reduction
      • Parallelization of data extraction and transformation
    • Knowledge Distillation: The idea behind knowledge distillation is to create a simple 'student' model that learns from a more complex 'teacher' model. The goal is to duplicate the performance of a complex model into a simpler, more efficient model.
  • Interpretability and Explainability: eXplainable Artificial Intelligence (XAI)
  • Focus of modeling:
    • Data centric: focuses on high quality data. it’s more practical in business applications.
    • Model centric: focuses on optimizing mode. it’s the main focus of academics.
  • Model Training:
    • Model Analysis Metrics:
      • Aggregate metrics: assess performance across the entire dataset.
      • Sliced Metrics: assess performance at a granular level on individual data subsets.
    • Model Robustness: A model is considered robust if its results are consistently accurate, even if one or more features change relatively drastically.
    • Model Debugging: finding and fixing problems in models and improving model robustness.
      • Benchmarking models
      • Sensitivity analysis
      • Residual analysis
    • Model Updating: Adding new data to model. it’s done in two methods.
      • Batch Learning: Data is processed regularly, often using an schedule. it’s simple and easy, however the delay may not be acceptable in some applications.
      • Online Learning: New data is immediately processed and model updated. it’s challenging as model must always be up and running.
  • Life cycle management
    • Systems degrade over time
    • Models fail over time
    • Experiment tracking
    • Data drifts over time
    • Model Deployment CLI and APIs
    • Continuously redeployment of new models