Machine Learning Project structure

In all projects, the structure of the project plays an important role in rapid development, readability, reproducibility, and empowering creative work while enforcing the best workflow practices. In the case of Machine Learning Projects, depending on the scope, size, and specific requirements of the project, they can be structured in different ways. However, there are certain commonalities in such project structures.
Here, I will introduce my project structure starter template and the workflow for extending it into a structure that can be used to organize files and directories in a machine learning project in a fast and dependable manner:

Starter Machine Learning Project Template:

Most ML Projects start with the following files and directories. Each one has a specific role in the project and depending on the project’s development journey may later be extended.

.
├── data
│ ├── raw
│ ├── .
│ ├── processed
│ ├── .
│ ├── temp
│ ├── .
│── models
│ ├── .
│── notebooks
│ ├── .
│── README.md
│── requirements.txt
│── .gitignore
  • data: This directory contains directories for raw(unprocessed) data, as well as processed data (after data preprocessing step), and temporary and intermittent data(generated during preprocessing).
  • models: The models directory is used to store ML model files, which will be generated after machine learning training process.
  • notebooks: The notebooks directory contains many experiments and notebooks. These codes will be creating during the development process.
  • readme.md: This is a brief introduction to the project, its goals, tools, and notes about the project.
  • requirements.txt: it’s used to store project dependencies.
  • .gitignore: It specifies which files and directories should be ignored by version control system. For example, the virtual environment directory should be listed in the file to be ignored.

Extending Project Structure

As you work on your project following your workflow or general Machine Learning Workflow, you will add more files to this structure.

Define Goals and Process

The first step in Machine Learning projects is to define the problems that should be solved and define project’s goals. It’s reflected in readme.md and depending on the developer’s documentation process, they might want to add the following directory and files to the project:

├── documentation
│ ├── problem-definition.md
│ ├── goals.md
│ ├── data-dictionary.md
│ ├── data-notes.md

With more files for different processes such as data-gathering and data-cleaning.md, or documentation of result such as baseline.md, and benchmark.md

Collecting Data

Collecting Data is the next step in Machine Learning Workflow. So if data is not static and copy-pasted into data\raw\ directory, depending on our data source, one of these files should be added:

├── notebooks
│   ├── collect.ipynb

or:

├── notebooks
│   ├── scrap.ipynb
  • collect.ipynb: Downloads, transfers, or uses data mining methods to gather raw data.
  • scrap.ipynb: Code for web mining to acquire data from websites and online resources.

Baseline

Baseline serves as a reference point for comparing the model’s results with the baseline. It’s used in our training process for attempting to beat the baseline. Therefore, before creating the model, baseline should be created:

├── notebooks
│   ├── baseline.ipynb

Exploratory Data Analysis (EDA)

Depending on the data size and the workload required for EDA, either a single eda.ipynb file is used or a structure similar to this is created:

├── notebooks
│   ├── eda
│       ├── descriptive-analysis.ipynb
│       ├── features-extraction.ipynb
│       ├── outlier-analysis.ipynb
│       ├── var-analysis.ipynb

Data Preparation

Similar to EDA, either a single data-preparation.ipynb file may be used to handle this task, or a structure similar to this is created:

├── notebooks
│   ├── data-preparation
│       ├── cleaning.ipynb
│       ├── transformation.ipynb
│       ├── data-reduction.ipynb

Training ML Model

To train the model, depending on scope, size, level of interpretability, and resources such as time and processing power, performing an experiment with multiple training methods may be best. So, starting with the following structure, different algorithms can be used for experimentation and evaluation:

├── notebooks
│   ├── experiments
│       ├── training-baseline.ipynb
│       ├── training-<method>.ipynb
│       ├── error-analysis-<method>.ipynb
│       ├── evaluate-<method>.ipynb
│       ├── evaluate.ipynb

Each file in this structure may come with a error-analysis-<method>.py file for performing error analysis and monitoring on each algorithm’s results.
Also, to evaluate each method, its own evaluation file should be added. Alternatively evaluation.ipynb file can be used for benchmarking all models.
Furthermore, other modules or scripts might be required which can be added to scripts director:

.
├── scripts
|   ├── module
|      ├── __init_.py

After desirable results are achieved, training and evaluation files can be added to the root directory of our project for deployment:

.
├── train.py
├── validation.py

These two files are chosen based on preferred training code, based on experiments’ results.

Serving Model

Finally to serve the model for end users or as a part of a project you must provide the right interface.
Deploying ML models as web services can be simply done by providing REST API, which can be done by adding following files:

.
├── load.py
├── api.py
  • load.py or persist.py: This file will load the model into memory and makes it persistent.
  • api.py or serve.py: This file provides a web service in production stage and handles routing and serving of tasks as REST API.

Deploying

To deploy your project, you will need to follow one of setup and deployment practices. Here depending on your preferences you can add one, or both of these files:

.
├── pyproject.toml
├── setup.py

Furthermore, depending on CI/CD(Continuous Integration & Continuous Delivery method) or other methods chosen by devops team, other files might be added here.

Conclusion

This structure is written with reproducibility in mind and adheres to the best practices of Machine Learning Workflow. However, depending on the MLOps practices and tools used in your project life-cycle, some modifications may be necessary. Furthermore, this structure can also be broken down for different stages of development to production. For instance, the production stage does not require directories such as data and notebooks, or files such as training.py, since they are only required for creating the ML Model and have no part in serving the ML model.

Leave a Comment