-
Notifications
You must be signed in to change notification settings - Fork 1
Home
BigNerd edited this page Mar 16, 2020
·
7 revisions
justmltools suggests and supports the following path layout for organizing your machine learning data in any local file system, in AWS S3 buckets and/or in MLflow experiment runs:
<prefix>
│
└───input
│ │
│ └───config
│ │ │ <your_config_file_1>
│ │ │ <your_config_file_2>
│ │ │ ...
│ │
| └───data
| │ <your_data_file_1>
| │ <your_data_file_2>
| | ...
|
└───model
| │ <your_model_file_1>
| │ <your_model_file_2>
| | ...
|
└───output
│ <your_output_file_1>
│ <your_output_file_2>
| ...
In a local file system the prefix can be any root path, such as /opt/ml/
or C:\projects\my_project\data
, for example.
In an S3 bucket the prefix can be any root path, such as /projects/my_project
, for example.
In an MLflow experiment run, artifacts can be stored in the same layout, but require no prefix.
Beneath the suggested standard paths
input/config
input/data/
model/
output/
you can either store files directly or use further sub paths to organize your files.
The following diagram shows a typical (slightly simplified) sequence of steps in a machine learning process.
- To start with, your experiment must get hold of a configuration file that defines all variable information controlling the experiment run. This config file can reside in your local file system or in a shared repository. Either way, the repo downloader will take care of making it available to the experiment locally and return the local file system path accordingly.
- The config file defines the reference(s) to the input data, which the experiment consequently requests from the same or another type of repo downloader in one or more steps. If any of the input data has been compressed, the repo downloader will recognize this by the file name suffix .zip and take care of unzipping it before returning the local file system path to the experiment. If the data has already been downloaded in a previous run on the same machine, the repo downloader will detect this and not download it again for increased efficiency.
- Once the experiment has requested and received all required input config and data, it can perform whatever is necessary to train its model.
- Finally, the experiment hands over all input config (including the input data references) and the local output artifact file paths to the tracker, which uploads them to the tracking repo.