MLaaS Service Dataset Generator

This project generates comparable Machine Learning as a Service records from reviewed Hugging Face service manifests. Each manifest row defines an independent model, dataset, task, and training regime. The runner validates the row, trains or loads the model, evaluates it on a benchmark split, records functional attributes and system metrics, and writes the resulting service data to a SQLite database.

The active workflow is:

registry -> hf-manifest -> review manifest -> run-manifest --dry-run -> run-manifest -> SQLite service records

Each manifest row describes one independent service instance. Executing a row trains or loads one model, evaluates it on its benchmark split, records functional attributes and service metrics, then stores one service record in SQLite.

Python And Platform Prerequisites

Python 3.12 is the recommended baseline for current Windows ROCm PyTorch environments. Python 3.11 remains fine for Linux and CPU-only setups.

On Ubuntu or Debian:

sudo apt update
sudo apt install -y git rsync unzip sqlite3 python3 python3-venv python3-dev build-essentia l

Check Python:

python3 --version

Create And Activate A Virtual Environment

From the repository root:

python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip setuptools wheel

Your shell prompt should now show (.venv).

On Windows PowerShell, the activation command is:

.\.venv\Scripts\Activate.ps1

For the Windows ROCm environment in this repository, prefer the helper script so the ROCm SDK target-family override is set before imports:

.\scripts\Activate-ROCm-Venv.ps1

Windows ROCm (AMD Radeon) Setup

For native Windows ROCm PyTorch on supported AMD GPUs, use Python 3.12 and AMD's ROCm 7.2 wheel set instead of generic pip install torch.

This repository includes a bootstrap script that creates a Python 3.12 virtual environment, installs the AMD ROCm SDK and PyTorch wheels, then installs the remaining project dependencies:

powershell -ExecutionPolicy Bypass -File .\scripts\setup_windows_rocm_venv.ps1

If you only need the Hugging Face and PyTorch workflows, and want to avoid installing the repo's optional TensorFlow/Keras path on Windows, use:

powershell -ExecutionPolicy Bypass -File .\scripts\setup_windows_rocm_venv.ps1 -SkipTensorFlow

Native Windows ROCm currently applies to the PyTorch path in this project. TensorFlow ROCm support is still Linux-oriented in AMD's documentation, so generic Keras/TensorFlow model paths on Windows should be treated as CPU-only unless you move those workflows to Linux or WSL.

Install Dependencies

For CPU-only use, install the requirements directly:

python -m pip install -r requirements.txt

For an NVIDIA, CUDA, or ROCm Linux machine, install the PyTorch wheel recommended by the official PyTorch selector first:

# Choose the exact command for your OS, Python version, and GPU from:
# https://pytorch.org/get-started/locally/
python -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
python -m pip install -r requirements.txt

The requirements.txt file uses normal package constraints for torch, torchvision, and torchaudio, so a compatible GPU build installed first should remain installed. On Windows ROCm, use .\scripts\setup_windows_rocm_venv.ps1 so the AMD ROCm wheels are installed before the shared requirements file.

Verify the key packages:

python - <<'PY'
import pandas
import torch
import transformers
import datasets

print("pandas", pandas.__version__)
print("torch", torch.__version__)
print("cuda_available", torch.cuda.is_available())
print("transformers", transformers.__version__)
print("datasets", datasets.__version__)
PY

Optional Environment Variables

Set these before running large jobs if you want caches and outputs on a fast disk with enough space:

export MLAAS_OUTDIR=/mnt/fast/mlaas-outputs
export HF_HOME=/mnt/fast/huggingface
export HF_DATASETS_CACHE=$HF_HOME/datasets

If you need private Hugging Face models or datasets, export a token:

export HF_TOKEN=<your-token>

For the service loop, you can also put the token in a repo-local .hf_token file. The manifest runner loads that file automatically before executing services. Environment variables still take precedence, so HF_TOKEN or HUGGING_FACE_HUB_TOKEN will override the file when set.

CLI

Run commands from the repository root:

python -m mlaas_data_generator.cli.main <command> [options]

Commands:

Command	Purpose
`hf-manifest`	Build reviewed service rows from the model and dataset registries.
`run-manifest`	Validate or execute reviewed service rows.

Check the installed CLI:

python -m mlaas_data_generator.cli.main --help
python -m mlaas_data_generator.cli.main hf-manifest --help
python -m mlaas_data_generator.cli.main run-manifest --help

Build A Manifest

The manifest builder reads:

mlaas_data_generator/registry/models.py
mlaas_data_generator/registry/datasets.py

Start small on a new machine:

mkdir -p outputs

python -m mlaas_data_generator.cli.main hf-manifest \
  --manifest-profile test \
  --resource-tier light \
  --task-keys text_classification,image_classification,tabular_regression \
  --models-per-task 4 \
  --datasets-per-model 1 \
  --training-regimes finetune_transfer,inference_only \
  --dataset-variants-per-pair 1 \
  --split-variants-per-pair 1 \
  --knob-variants-per-pair 2 \
  --total-services 8 \
  --output outputs/service_manifest.xlsx

This writes an Excel workbook with a services sheet and a defaults sheet. By default, hf-manifest now uses a fresh random seed on each run, so repeated invocations produce slightly different reviewed manifests. Pass --seed <int> when you want a reproducible manifest.

Useful manifest profiles:

Profile	Use case
`test`	Small smoke runs for a new machine.
`balanced`	Moderate sample sizes and runtime.
`benchmark`	Larger runs for stronger hardware.

Common task keys include:

Task key	Typical workload
`text_classification`	Text sequence classification.
`token_classification`	Named entity or token label tasks.
`sentence_similarity`	Pair scoring and similarity.
`fill_mask`	Masked language modelling.
`text_generation`	Causal language modelling.
`text2text_generation`	Summarisation and sequence-to-sequence generation.
`image_classification`	Image classification.
`object_detection`	Object detection.
`image_segmentation`	Segmentation.
`image_captioning`	Image-to-text generation.
`text_image_retrieval`	Image/text retrieval.
`visual_question_answering`	VQA.
`tabular_regression`	Generic tabular regression service rows.

Review The Manifest

Open outputs/service_manifest.xlsx before executing it.

Important columns:

Column	Purpose
`enabled`	Set to `false` to skip a row. Missing values default to enabled.
`service_id`	Primary service identifier. Missing values are generated deterministically.
`case_name`	Human-readable model/dataset/regime label.
`dataset`, `dataset_name`, `dataset_config`	Dataset source and provider identifiers.
`model_type`, `hf_model_id`, `hf_task`	Runner and model identifiers.
`task_type`, `task`, `task_tag`, `modality`	Functional compatibility attributes.
`train_split`, `test_split`, `benchmark_split`	Training and benchmark split names.
`training_regime`	`finetune_transfer`, `inference_only`, or `generic`.
`resource_tier`	Workload budget: `light`, `medium`, `heavy`, or `stress_test`.
`training_epochs`, `batch_size`, `learning_rate`, `optimizer`	Training and runtime knobs.
`max_samples`, `max_length`, `timeout_s`, `max_train_time_s`, `max_eval_time_s`, `device`	Workload and runtime controls.
`input_schema`, `output_schema`	Compatibility metadata for later composition work.

For first runs on a new computer, reduce risk by keeping max_samples low, using --manifest-profile test, and setting enabled=false for rows you do not want to run yet.

--resource-tier controls model, dataset, and knob selection. If omitted, it follows the profile: test -> light, balanced -> medium, and benchmark -> heavy. Use stress_test only when you intentionally want the largest allowed services.

For GPU runs, leave device blank or set it to auto unless you need to force a device. PyTorch exposes ROCm devices through the torch.cuda API, so the runner will still resolve a supported AMD ROCm GPU as cuda.

On multi-GPU Linux machines, this project uses GPUs by pinning worker processes to individual GPUs. With grouped HF execution enabled, different Hugging Face model groups can be scheduled onto different GPUs while still reusing prepared models and datasets inside each group.

For a 2-GPU NVIDIA Linux VM, use:

python -m mlaas_data_generator.cli.main run-manifest \
  --file outputs/service_manifest.xlsx \
  --sheet services \
  --db outputs/services.db \
  --workers 2

--workers 2 starts two GPU-pinned worker processes. On grouped HF runs, different HF groups can run on different GPUs. On row-local runs, independent rows can run on different GPUs.

When GPU-parallel worker processes are used, each GPU writes to its own SQLite database file to avoid concurrent writes into the same SQLite database. For example, --db outputs/services.db --workers 2 will produce files such as outputs/services.gpu0.db and outputs/services.gpu1.db.

During run-manifest, the CLI now shows a live manifest progress footer with completed rows out of total enabled rows plus current worker/GPU assignments. In multi-worker runs, progress is counted from completed manifest rows rather than submitted tasks, so grouped HF workers and parallel GPU processes still report accurate manifest-level completion.

Use --no-grouped-hf only when you want the most aggressive row-level parallelism and are willing to trade away grouped model/dataset reuse. Keeping grouped HF enabled is usually the better default when many rows share the same HF model.

CSV manifests can include a row with service_id=defaults. XLSX manifests can include a defaults sheet.

Validate A Manifest

Dry-run validation does not train models. It normalizes column names, applies defaults, validates enabled rows, resolves missing service_id values, and writes outputs/service_manifest_results.csv.

python -m mlaas_data_generator.cli.main run-manifest \
  --file outputs/service_manifest.xlsx \
  --sheet services \
  --dry-run

If validation fails, check:

missing required columns such as dataset, model_type, or task_type
invalid training_regime
missing hf_model_id or hf_task for Hugging Face rows
stale sheet names if you changed --sheet

Run The Program

After the dry run succeeds, execute the enabled service rows:

python -m mlaas_data_generator.cli.main run-manifest \
  --file outputs/service_manifest.xlsx \
  --sheet services \
  --db outputs/services.db

For the 2-GPU Linux VM shown above, a good default is:

python -m mlaas_data_generator.cli.main run-manifest \
  --file outputs/service_manifest.xlsx \
  --sheet services \
  --db outputs/services.db \
  --workers 2

If you explicitly want to maximize row-level spreading instead of grouped reuse, use:

python -m mlaas_data_generator.cli.main run-manifest \
  --file outputs/service_manifest.xlsx \
  --sheet services \
  --db outputs/services.db \
  --workers 2 \
  --no-grouped-hf

To confirm both GPUs are active while the run is in progress:

watch -n 1 nvidia-smi

Or capture a compact view:

nvidia-smi --query-gpu=index,name,utilization.gpu,memory.used,memory.total --format=csv -l 1

The run writes:

Path	Contents
`outputs/services.db`	SQLite database containing service records and metrics.
`outputs/services.gpu0.db`, `outputs/services.gpu1.db`, ...	Per-GPU SQLite databases created automatically during GPU-parallel runs.
`outputs/service_manifest_results.csv`	Per-row success/failure summary.
`outputs/service_manifest_failed.csv`	Retry manifest containing only failed row-level runs from the latest manifest execution.
`outputs/service_failures.log`	Detailed validation or runtime failures.

Successful rows are written to the SQLite database configured by CONFIG["db_path"], MLAAS_DB_PATH, MLAAS_SQL_DB_PATH, or the --db override.

Database Tables

The active schema is service-only:

Table	Contents
`services`	One row per manifest service instance.
`service_metrics`	Typed quality, QoS, latency, runtime, resource, cost, reliability, explainability, and metadata metrics.
`service_artifacts`	Optional model, report, or output artifact references.
`service_split_provenance`	Optional split and distribution provenance.
`service_failures`	Validation and execution failure details.

There are no active federated workflow or model-averaging tables.

Query Results

Use SQLite directly:

sqlite3 outputs/services.db ".tables"
sqlite3 outputs/services.db "select service_id, status, task_type, training_regime from services limit 10;"

Or load results in Python:

python - <<'PY'
import sqlite3
import pandas as pd

conn = sqlite3.connect("outputs/services.db")
df = pd.read_sql_query("select * from services limit 10", conn)
print(df)
PY

Scaling Up On A More Powerful Machine

After the smoke run works:

Increase --total-services.
Move from --manifest-profile test / --resource-tier light to balanced / medium or benchmark / heavy.
Add more --task-keys.
Increase --models-per-task or --datasets-per-model.
Increase max_samples in the manifest or use --avg-sample-size.

Example larger manifest:

python -m mlaas_data_generator.cli.main hf-manifest \
  --manifest-profile balanced \
  --resource-tier medium \
  --task-keys text_classification,token_classification,sentence_similarity,image_classification,object_detection \
  --models-per-task 8 \
  --datasets-per-model 2 \
  --training-regimes finetune_transfer,inference_only \
  --dataset-variants-per-pair 1 \
  --split-variants-per-pair 1 \
  --knob-variants-per-pair 2 \
  --total-services 40 \
  --output outputs/service_manifest_balanced.xlsx

Validate it:

python -m mlaas_data_generator.cli.main run-manifest \
  --file outputs/service_manifest_balanced.xlsx \
  --sheet services \
  --dry-run

Run it:

python -m mlaas_data_generator.cli.main run-manifest \
  --file outputs/service_manifest_balanced.xlsx \
  --sheet services \
  --db outputs/services_balanced.db

Tests

Install requirements first, then run:

python -m pytest mlaas_data_generator/test

Focused checks:

python -m pytest \
  mlaas_data_generator/test/test_service_manifest_pipeline.py \
  mlaas_data_generator/test/test_service_storage.py \
  mlaas_data_generator/test/test_service_runner.py

Troubleshooting

If torch.cuda.is_available() is False, check the installed PyTorch build first. On NVIDIA, verify the CUDA install you selected. On AMD Windows ROCm, verify that you used .\scripts\setup_windows_rocm_venv.ps1, that Python is 3.12, and that the installed wheel version matches AMD's current Windows ROCm support matrix.

If a Hugging Face dataset or model fails to download, check internet access, disk space, HF_HOME, HF_DATASETS_CACHE, and whether the model or dataset requires HF_TOKEN.

If Excel output fails, confirm openpyxl is installed in the active virtual environment:

python -m pip show openpyxl

If a run fails partway through, inspect:

tail -n 80 outputs/service_failures.log
python - <<'PY'
import sqlite3
import pandas as pd

conn = sqlite3.connect("outputs/services.db")
print(pd.read_sql_query("select * from service_failures order by failure_id desc limit 10", conn))
PY

To rerun only failed rows after fixing the cause, use:

python -m mlaas_data_generator.cli.main run-manifest \
  --file outputs/service_manifest_failed.csv \
  --db outputs/services.db

Extending

Add HF models in mlaas_data_generator/registry/models.py.
Add HF datasets in mlaas_data_generator/registry/datasets.py.
Keep new execution behavior row-local: one manifest row produces one independent service record.
Add future composition logic in a separate layer that reads the service table; do not couple composition to service generation.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
Dataset		Dataset
experiments		experiments
mlaas_data_generator		mlaas_data_generator
scripts		scripts
service_requests		service_requests
.gitignore		.gitignore
README.md		README.md
flatten.sql		flatten.sql
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MLaaS Service Dataset Generator

Python And Platform Prerequisites

Create And Activate A Virtual Environment

Windows ROCm (AMD Radeon) Setup

Install Dependencies

Optional Environment Variables

CLI

Build A Manifest

Review The Manifest

Validate A Manifest

Run The Program

Database Tables

Query Results

Scaling Up On A More Powerful Machine

Tests

Troubleshooting

Extending

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MLaaS Service Dataset Generator

Python And Platform Prerequisites

Create And Activate A Virtual Environment

Windows ROCm (AMD Radeon) Setup

Install Dependencies

Optional Environment Variables

CLI

Build A Manifest

Review The Manifest

Validate A Manifest

Run The Program

Database Tables

Query Results

Scaling Up On A More Powerful Machine

Tests

Troubleshooting

Extending

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages