This project generates comparable Machine Learning as a Service records from reviewed Hugging Face service manifests. Each manifest row defines an independent model, dataset, task, and training regime. The runner validates the row, trains or loads the model, evaluates it on a benchmark split, records functional attributes and system metrics, and writes the resulting service data to a SQLite database.
The active workflow is:
registry -> hf-manifest -> review manifest -> run-manifest --dry-run -> run-manifest -> SQLite service records
Each manifest row describes one independent service instance. Executing a row trains or loads one model, evaluates it on its benchmark split, records functional attributes and service metrics, then stores one service record in SQLite.
Python 3.12 is the recommended baseline for current Windows ROCm PyTorch environments. Python 3.11 remains fine for Linux and CPU-only setups.
On Ubuntu or Debian:
sudo apt update
sudo apt install -y git rsync unzip sqlite3 python3 python3-venv python3-dev build-essentia lCheck Python:
python3 --versionFrom the repository root:
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip setuptools wheelYour shell prompt should now show (.venv).
On Windows PowerShell, the activation command is:
.\.venv\Scripts\Activate.ps1For the Windows ROCm environment in this repository, prefer the helper script so the ROCm SDK target-family override is set before imports:
.\scripts\Activate-ROCm-Venv.ps1For native Windows ROCm PyTorch on supported AMD GPUs, use Python 3.12 and AMD's ROCm 7.2 wheel set instead of generic pip install torch.
This repository includes a bootstrap script that creates a Python 3.12 virtual environment, installs the AMD ROCm SDK and PyTorch wheels, then installs the remaining project dependencies:
powershell -ExecutionPolicy Bypass -File .\scripts\setup_windows_rocm_venv.ps1If you only need the Hugging Face and PyTorch workflows, and want to avoid installing the repo's optional TensorFlow/Keras path on Windows, use:
powershell -ExecutionPolicy Bypass -File .\scripts\setup_windows_rocm_venv.ps1 -SkipTensorFlowNative Windows ROCm currently applies to the PyTorch path in this project. TensorFlow ROCm support is still Linux-oriented in AMD's documentation, so generic Keras/TensorFlow model paths on Windows should be treated as CPU-only unless you move those workflows to Linux or WSL.
For CPU-only use, install the requirements directly:
python -m pip install -r requirements.txtFor an NVIDIA, CUDA, or ROCm Linux machine, install the PyTorch wheel recommended by the official PyTorch selector first:
# Choose the exact command for your OS, Python version, and GPU from:
# https://pytorch.org/get-started/locally/
python -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
python -m pip install -r requirements.txtThe requirements.txt file uses normal package constraints for torch, torchvision, and torchaudio, so a compatible GPU build installed first should remain installed. On Windows ROCm, use .\scripts\setup_windows_rocm_venv.ps1 so the AMD ROCm wheels are installed before the shared requirements file.
Verify the key packages:
python - <<'PY'
import pandas
import torch
import transformers
import datasets
print("pandas", pandas.__version__)
print("torch", torch.__version__)
print("cuda_available", torch.cuda.is_available())
print("transformers", transformers.__version__)
print("datasets", datasets.__version__)
PYSet these before running large jobs if you want caches and outputs on a fast disk with enough space:
export MLAAS_OUTDIR=/mnt/fast/mlaas-outputs
export HF_HOME=/mnt/fast/huggingface
export HF_DATASETS_CACHE=$HF_HOME/datasetsIf you need private Hugging Face models or datasets, export a token:
export HF_TOKEN=<your-token>For the service loop, you can also put the token in a repo-local .hf_token file. The manifest runner loads that file automatically before executing services. Environment variables still take precedence, so HF_TOKEN or HUGGING_FACE_HUB_TOKEN will override the file when set.
Run commands from the repository root:
python -m mlaas_data_generator.cli.main <command> [options]Commands:
| Command | Purpose |
|---|---|
hf-manifest |
Build reviewed service rows from the model and dataset registries. |
run-manifest |
Validate or execute reviewed service rows. |
Check the installed CLI:
python -m mlaas_data_generator.cli.main --help
python -m mlaas_data_generator.cli.main hf-manifest --help
python -m mlaas_data_generator.cli.main run-manifest --helpThe manifest builder reads:
mlaas_data_generator/registry/models.pymlaas_data_generator/registry/datasets.py
Start small on a new machine:
mkdir -p outputs
python -m mlaas_data_generator.cli.main hf-manifest \
--manifest-profile test \
--resource-tier light \
--task-keys text_classification,image_classification,tabular_regression \
--models-per-task 4 \
--datasets-per-model 1 \
--training-regimes finetune_transfer,inference_only \
--dataset-variants-per-pair 1 \
--split-variants-per-pair 1 \
--knob-variants-per-pair 2 \
--total-services 8 \
--output outputs/service_manifest.xlsxThis writes an Excel workbook with a services sheet and a defaults sheet.
By default, hf-manifest now uses a fresh random seed on each run, so repeated invocations produce slightly different reviewed manifests. Pass --seed <int> when you want a reproducible manifest.
Useful manifest profiles:
| Profile | Use case |
|---|---|
test |
Small smoke runs for a new machine. |
balanced |
Moderate sample sizes and runtime. |
benchmark |
Larger runs for stronger hardware. |
Common task keys include:
| Task key | Typical workload |
|---|---|
text_classification |
Text sequence classification. |
token_classification |
Named entity or token label tasks. |
sentence_similarity |
Pair scoring and similarity. |
fill_mask |
Masked language modelling. |
text_generation |
Causal language modelling. |
text2text_generation |
Summarisation and sequence-to-sequence generation. |
image_classification |
Image classification. |
object_detection |
Object detection. |
image_segmentation |
Segmentation. |
image_captioning |
Image-to-text generation. |
text_image_retrieval |
Image/text retrieval. |
visual_question_answering |
VQA. |
tabular_regression |
Generic tabular regression service rows. |
Open outputs/service_manifest.xlsx before executing it.
Important columns:
| Column | Purpose |
|---|---|
enabled |
Set to false to skip a row. Missing values default to enabled. |
service_id |
Primary service identifier. Missing values are generated deterministically. |
case_name |
Human-readable model/dataset/regime label. |
dataset, dataset_name, dataset_config |
Dataset source and provider identifiers. |
model_type, hf_model_id, hf_task |
Runner and model identifiers. |
task_type, task, task_tag, modality |
Functional compatibility attributes. |
train_split, test_split, benchmark_split |
Training and benchmark split names. |
training_regime |
finetune_transfer, inference_only, or generic. |
resource_tier |
Workload budget: light, medium, heavy, or stress_test. |
training_epochs, batch_size, learning_rate, optimizer |
Training and runtime knobs. |
max_samples, max_length, timeout_s, max_train_time_s, max_eval_time_s, device |
Workload and runtime controls. |
input_schema, output_schema |
Compatibility metadata for later composition work. |
For first runs on a new computer, reduce risk by keeping max_samples low, using --manifest-profile test, and setting enabled=false for rows you do not want to run yet.
--resource-tier controls model, dataset, and knob selection. If omitted, it follows the profile: test -> light, balanced -> medium, and benchmark -> heavy. Use stress_test only when you intentionally want the largest allowed services.
For GPU runs, leave device blank or set it to auto unless you need to force a device. PyTorch exposes ROCm devices through the torch.cuda API, so the runner will still resolve a supported AMD ROCm GPU as cuda.
On multi-GPU Linux machines, this project uses GPUs by pinning worker processes to individual GPUs. With grouped HF execution enabled, different Hugging Face model groups can be scheduled onto different GPUs while still reusing prepared models and datasets inside each group.
For a 2-GPU NVIDIA Linux VM, use:
python -m mlaas_data_generator.cli.main run-manifest \
--file outputs/service_manifest.xlsx \
--sheet services \
--db outputs/services.db \
--workers 2--workers 2 starts two GPU-pinned worker processes. On grouped HF runs, different HF groups can run on different GPUs. On row-local runs, independent rows can run on different GPUs.
When GPU-parallel worker processes are used, each GPU writes to its own SQLite database file to avoid concurrent writes into the same SQLite database. For example, --db outputs/services.db --workers 2 will produce files such as outputs/services.gpu0.db and outputs/services.gpu1.db.
During run-manifest, the CLI now shows a live manifest progress footer with completed rows out of total enabled rows plus current worker/GPU assignments. In multi-worker runs, progress is counted from completed manifest rows rather than submitted tasks, so grouped HF workers and parallel GPU processes still report accurate manifest-level completion.
Use --no-grouped-hf only when you want the most aggressive row-level parallelism and are willing to trade away grouped model/dataset reuse. Keeping grouped HF enabled is usually the better default when many rows share the same HF model.
CSV manifests can include a row with service_id=defaults. XLSX manifests can include a defaults sheet.
Dry-run validation does not train models. It normalizes column names, applies defaults, validates enabled rows, resolves missing service_id values, and writes outputs/service_manifest_results.csv.
python -m mlaas_data_generator.cli.main run-manifest \
--file outputs/service_manifest.xlsx \
--sheet services \
--dry-runIf validation fails, check:
- missing required columns such as
dataset,model_type, ortask_type - invalid
training_regime - missing
hf_model_idorhf_taskfor Hugging Face rows - stale sheet names if you changed
--sheet
After the dry run succeeds, execute the enabled service rows:
python -m mlaas_data_generator.cli.main run-manifest \
--file outputs/service_manifest.xlsx \
--sheet services \
--db outputs/services.dbFor the 2-GPU Linux VM shown above, a good default is:
python -m mlaas_data_generator.cli.main run-manifest \
--file outputs/service_manifest.xlsx \
--sheet services \
--db outputs/services.db \
--workers 2If you explicitly want to maximize row-level spreading instead of grouped reuse, use:
python -m mlaas_data_generator.cli.main run-manifest \
--file outputs/service_manifest.xlsx \
--sheet services \
--db outputs/services.db \
--workers 2 \
--no-grouped-hfTo confirm both GPUs are active while the run is in progress:
watch -n 1 nvidia-smiOr capture a compact view:
nvidia-smi --query-gpu=index,name,utilization.gpu,memory.used,memory.total --format=csv -l 1The run writes:
| Path | Contents |
|---|---|
outputs/services.db |
SQLite database containing service records and metrics. |
outputs/services.gpu0.db, outputs/services.gpu1.db, ... |
Per-GPU SQLite databases created automatically during GPU-parallel runs. |
outputs/service_manifest_results.csv |
Per-row success/failure summary. |
outputs/service_manifest_failed.csv |
Retry manifest containing only failed row-level runs from the latest manifest execution. |
outputs/service_failures.log |
Detailed validation or runtime failures. |
Successful rows are written to the SQLite database configured by CONFIG["db_path"], MLAAS_DB_PATH, MLAAS_SQL_DB_PATH, or the --db override.
The active schema is service-only:
| Table | Contents |
|---|---|
services |
One row per manifest service instance. |
service_metrics |
Typed quality, QoS, latency, runtime, resource, cost, reliability, explainability, and metadata metrics. |
service_artifacts |
Optional model, report, or output artifact references. |
service_split_provenance |
Optional split and distribution provenance. |
service_failures |
Validation and execution failure details. |
There are no active federated workflow or model-averaging tables.
Use SQLite directly:
sqlite3 outputs/services.db ".tables"
sqlite3 outputs/services.db "select service_id, status, task_type, training_regime from services limit 10;"Or load results in Python:
python - <<'PY'
import sqlite3
import pandas as pd
conn = sqlite3.connect("outputs/services.db")
df = pd.read_sql_query("select * from services limit 10", conn)
print(df)
PYAfter the smoke run works:
- Increase
--total-services. - Move from
--manifest-profile test/--resource-tier lighttobalanced/mediumorbenchmark/heavy. - Add more
--task-keys. - Increase
--models-per-taskor--datasets-per-model. - Increase
max_samplesin the manifest or use--avg-sample-size.
Example larger manifest:
python -m mlaas_data_generator.cli.main hf-manifest \
--manifest-profile balanced \
--resource-tier medium \
--task-keys text_classification,token_classification,sentence_similarity,image_classification,object_detection \
--models-per-task 8 \
--datasets-per-model 2 \
--training-regimes finetune_transfer,inference_only \
--dataset-variants-per-pair 1 \
--split-variants-per-pair 1 \
--knob-variants-per-pair 2 \
--total-services 40 \
--output outputs/service_manifest_balanced.xlsxValidate it:
python -m mlaas_data_generator.cli.main run-manifest \
--file outputs/service_manifest_balanced.xlsx \
--sheet services \
--dry-runRun it:
python -m mlaas_data_generator.cli.main run-manifest \
--file outputs/service_manifest_balanced.xlsx \
--sheet services \
--db outputs/services_balanced.dbInstall requirements first, then run:
python -m pytest mlaas_data_generator/testFocused checks:
python -m pytest \
mlaas_data_generator/test/test_service_manifest_pipeline.py \
mlaas_data_generator/test/test_service_storage.py \
mlaas_data_generator/test/test_service_runner.pyIf torch.cuda.is_available() is False, check the installed PyTorch build first. On NVIDIA, verify the CUDA install you selected. On AMD Windows ROCm, verify that you used .\scripts\setup_windows_rocm_venv.ps1, that Python is 3.12, and that the installed wheel version matches AMD's current Windows ROCm support matrix.
If a Hugging Face dataset or model fails to download, check internet access, disk space, HF_HOME, HF_DATASETS_CACHE, and whether the model or dataset requires HF_TOKEN.
If Excel output fails, confirm openpyxl is installed in the active virtual environment:
python -m pip show openpyxlIf a run fails partway through, inspect:
tail -n 80 outputs/service_failures.log
python - <<'PY'
import sqlite3
import pandas as pd
conn = sqlite3.connect("outputs/services.db")
print(pd.read_sql_query("select * from service_failures order by failure_id desc limit 10", conn))
PYTo rerun only failed rows after fixing the cause, use:
python -m mlaas_data_generator.cli.main run-manifest \
--file outputs/service_manifest_failed.csv \
--db outputs/services.db- Add HF models in
mlaas_data_generator/registry/models.py. - Add HF datasets in
mlaas_data_generator/registry/datasets.py. - Keep new execution behavior row-local: one manifest row produces one independent service record.
- Add future composition logic in a separate layer that reads the service table; do not couple composition to service generation.