Data Pipeline

It's a technical infrastructure used for automating, monitoring and governing data related tasks such as Data Mining and Data Preparation, and is responsible for passing pipeline output to analysis or Machine Learning modules.


Process of creating data pipelines:

  1. Identify Data Nodes. Data Nodes represent any data that should be extracted, loaded, manipulated(Extract, Transform, Load (ETL) process), and finally saved.
  2. Create Tasks, that are functions that will interact with the data.
  3. Combines data nodes with tasks to perform Data Preparation.

Concerns:

  • Reliability
  • Scalability
  • Efficiency

Notes:

  • taipy is a VSCode plugin and Python library for creating and running data pipelines that urns Data and AI algorithms into production-ready web applications in no time.
  • Often pipeline tasks can be run simultaneously to improve performance.