In all projects, the structure of the project plays an important role in rapid development, readability, reproducibility, and empowering creative work while enforcing the best workflow practices. In the case of Machine Learning Projects, depending on the scope, size, and specific requirements of the project, they can be structured in different ways. However, there are certain commonalities in such project structures.
Here, I will introduce my project structure starter template and the workflow for extending it into a structure that can be used to organize files and directories in a machine learning project in a fast and dependable manner:
Starter Machine Learning Project Template:
Most ML Projects start with the following files and directories. Each one has a specific role in the project and depending on the project’s development journey may later be extended.
. ├── data │ ├── raw │ ├── . │ ├── processed │ ├── . │ ├── temp │ ├── . │── models │ ├── . │── notebooks │ ├── . │── README.md │── requirements.txt │── .gitignore
data: This directory contains directories for raw(unprocessed) data, as well as processed data (after data preprocessing step), and temporary and intermittent data(generated during preprocessing).
modelsdirectory is used to store ML model files, which will be generated after machine learning training process.
notebooksdirectory contains many experiments and notebooks. These codes will be creating during the development process.
readme.md: This is a brief introduction to the project, its goals, tools, and notes about the project.
requirements.txt: it’s used to store project dependencies.
.gitignore: It specifies which files and directories should be ignored by version control system. For example, the virtual environment directory should be listed in the file to be ignored.
Extending Project Structure
As you work on your project following your workflow or general Machine Learning Workflow, you will add more files to this structure.
Define Goals and Process
The first step in Machine Learning projects is to define the problems that should be solved and define project’s goals. It’s reflected in
readme.md and depending on the developer’s documentation process, they might want to add the following directory and files to the project:
│ ├── problem-definition.md
│ ├── goals.md
│ ├── data-dictionary.md
│ ├── data-notes.md
With more files for different processes such as
data-cleaning.md, or documentation of result such as
Collecting Data is the next step in Machine Learning Workflow. So if data is not static and copy-pasted into
data\raw\ directory, depending on our data source, one of these files should be added:
├── notebooks │ ├── collect.ipynb
├── notebooks │ ├── scrap.ipynb
collect.ipynb: Downloads, transfers, or uses data mining methods to gather raw data.
scrap.ipynb: Code for web mining to acquire data from websites and online resources.
Baseline serves as a reference point for comparing the model’s results with the baseline. It’s used in our training process for attempting to beat the baseline. Therefore, before creating the model, baseline should be created:
├── notebooks │ ├── baseline.ipynb
Exploratory Data Analysis (EDA)
Depending on the data size and the workload required for EDA, either a single
eda.ipynb file is used or a structure similar to this is created:
├── notebooks │ ├── eda │ ├── descriptive-analysis.ipynb │ ├── features-extraction.ipynb │ ├── outlier-analysis.ipynb │ ├── var-analysis.ipynb
Similar to EDA, either a single
data-preparation.ipynb file may be used to handle this task, or a structure similar to this is created:
├── notebooks │ ├── data-preparation │ ├── cleaning.ipynb │ ├── transformation.ipynb │ ├── data-reduction.ipynb
Training ML Model
To train the model, depending on scope, size, level of interpretability, and resources such as time and processing power, performing an experiment with multiple training methods may be best. So, starting with the following structure, different algorithms can be used for experimentation and evaluation:
├── notebooks │ ├── experiments │ ├── training-baseline.ipynb │ ├── training-<method>.ipynb │ ├── error-analysis-<method>.ipynb │ ├── evaluate-<method>.ipynb │ ├── evaluate.ipynb
Each file in this structure may come with a error-analysis-<method>.py file for performing error analysis and monitoring on each algorithm’s results.
Also, to evaluate each method, its own evaluation file should be added. Alternatively
evaluation.ipynb file can be used for benchmarking all models.
Furthermore, other modules or scripts might be required which can be added to
. ├── scripts | ├── module | ├── __init_.py
After desirable results are achieved, training and evaluation files can be added to the root directory of our project for deployment:
. ├── train.py ├── validation.py
These two files are chosen based on preferred training code, based on experiments’ results.
Finally to serve the model for end users or as a part of a project you must provide the right interface.
Deploying ML models as web services can be simply done by providing REST API, which can be done by adding following files:
. ├── load.py ├── api.py
persist.py: This file will load the model into memory and makes it persistent.
serve.py: This file provides a web service in production stage and handles routing and serving of tasks as REST API.
To deploy your project, you will need to follow one of setup and deployment practices. Here depending on your preferences you can add one, or both of these files:
. ├── pyproject.toml ├── setup.py
Furthermore, depending on CI/CD(Continuous Integration & Continuous Delivery method) or other methods chosen by devops team, other files might be added here.
This structure is written with reproducibility in mind and adheres to the best practices of Machine Learning Workflow. However, depending on the MLOps practices and tools used in your project life-cycle, some modifications may be necessary. Furthermore, this structure can also be broken down for different stages of development to production. For instance, the production stage does not require directories such as
notebooks, or files such as
training.py, since they are only required for creating the ML Model and have no part in serving the ML model.