Starter Template

Creating reproducible Machine Learning projects:


Starter project structure:

.
├── data
│   ├── raw
│       ├── .
│   ├── processed
│       ├── .
│   ├── temp
│       ├── .
├── models
│   ├── .
├── notebooks
│   ├── .
├── README.md
├── requirements.txt
├── .gitignore
  • data: This directory contains directories for raw(unprocessed) data, as well as processed data (after data preprocessing step), and temporary and intermittent data(generated during preprocessing).
  • models: The models directory is used to store ML model files, which will be generated after machine learning training process.
  • notebooks: The notebooks directory contains many experiments and notebooks. These codes will be creating during the development process.
  • readme.md: This is a brief introduction to the project, its goals, tools, and notes about the project.
  • requirements.txt: it's used to store project dependencies.
  • .gitignore: It specifies which files and directories should be ignored by version control system. For example, the virtual environment directory should be listed in the file to be ignored.

Extending Project Structure

Project is extended by following Data Science Lifecycle process.

Define Goals and Process

├── documentation
│   ├── problem-definition.md
│   ├── goals.md
│   ├── data-dictionary.md
│   ├── data-notes.md

Collecting Data

Depending on whether Data Mining is done by download datasets or performs Web Mining:

├── notebooks
│   ├── collect.ipynb

or:

├── notebooks
│   ├── scrap.ipynb
  • collect.ipynb: Downloads, transfers, or uses Data Mining methods to gather raw data.
  • scrap.ipynb: Code for web mining to acquire data from websites and online resources.

Exploratory Data Analysis (EDA)

Depending on the data size and the workload required for Exploratory Data Analysis (EDA), either a single eda.ipynb file is used or a structure similar to this is created:

├── notebooks
│   ├── eda
│       ├── descriptive-analysis.ipynb
│       ├── features-extraction.ipynb
│       ├── outlier-analysis.ipynb
│       ├── var-analysis.ipynb

Data Preparation

Similar to EDA the Data Preparation task can have a single data-preparation.ipynb file for handling this task, or it can have a structure similar to this:

├── notebooks
│   ├── data-preparation
│       ├── cleaning.ipynb
│       ├── transformation.ipynb
│       ├── data-reduction.ipynb

Training Machine Learning Model

Baseline

Baseline serves as a reference point for comparing the model's results with the baseline.

├── notebooks
│   ├── baseline.ipynb

Training

Experimentation and initial evaluations preside training code that is served the final results:

├── notebooks
│   ├── experiments
│       ├── training-baseline.ipynb
│       ├── training-<method>.ipynb
│       ├── error-analysis-<method>.ipynb
│       ├── evaluate-<method>.ipynb
│       ├── evaluate.ipynb
  • Each file in this structure may come with a error-analysis-<method>.py file for performing error analysis and monitoring on each algorithm's results.
  • To evaluate each method, its own evaluation file should be added. Alternatively evaluation.ipynb file can be used for benchmarking all models.
    Furthermore, other modules or scripts might be required which can be added to scripts director:
.
├── scripts
|   ├── module
|      ├── __init_.py

After desirable results are achieved, training and evaluation files can be added to the root directory of our project for deployment:

.
├── train.py
├── validation.py

These two files are chosen based on preferred training code, based on experiments' results.

Serving Model

Finally to serve the model for end users or as a part of a project, a suitable interface must be provided. It can load a web app in user's browser or provide an API.

.
├── load.py
├── api.py
  • load.py or persist.py: This file will load the model into memory and makes it persistent.
  • api.py or serve.py: This file provides a web service in production stage and handles routing and serving of tasks as REST API.

Deploying

To deploy your project, you will need to follow one of setup and deployment practices. Here depending on your preferences you can add one, or both of these files:

.
├── pyproject.toml
├── setup.py

Furthermore, depending on CI/CD(Continuous Integration & Continuous Delivery method) or other methods chosen by DevOps team, other files might be added here.


Resources: