Skip to content

YGYerrd/MLaaS_Dataset_Gen

Repository files navigation

MLaaS Service Dataset Generator

This project generates comparable Machine Learning as a Service records from reviewed Hugging Face service manifests. Each manifest row defines an independent model, dataset, task, and training regime. The runner validates the row, trains or loads the model, evaluates it on a benchmark split, records functional attributes and system metrics, and writes the resulting service data to a SQLite database.

The active workflow is:

registry -> hf-manifest -> review manifest -> run-manifest --dry-run -> run-manifest -> SQLite service records

Each manifest row describes one independent service instance. Executing a row trains or loads one model, evaluates it on its benchmark split, records functional attributes and service metrics, then stores one service record in SQLite.

Python And Platform Prerequisites

Python 3.12 is the recommended baseline for current Windows ROCm PyTorch environments. Python 3.11 remains fine for Linux and CPU-only setups.

On Ubuntu or Debian:

sudo apt update
sudo apt install -y git rsync unzip sqlite3 python3 python3-venv python3-dev build-essentia l

Check Python:

python3 --version

Create And Activate A Virtual Environment

From the repository root:

python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip setuptools wheel

Your shell prompt should now show (.venv).

On Windows PowerShell, the activation command is:

.\.venv\Scripts\Activate.ps1

For the Windows ROCm environment in this repository, prefer the helper script so the ROCm SDK target-family override is set before imports:

.\scripts\Activate-ROCm-Venv.ps1

Windows ROCm (AMD Radeon) Setup

For native Windows ROCm PyTorch on supported AMD GPUs, use Python 3.12 and AMD's ROCm 7.2 wheel set instead of generic pip install torch.

This repository includes a bootstrap script that creates a Python 3.12 virtual environment, installs the AMD ROCm SDK and PyTorch wheels, then installs the remaining project dependencies:

powershell -ExecutionPolicy Bypass -File .\scripts\setup_windows_rocm_venv.ps1

If you only need the Hugging Face and PyTorch workflows, and want to avoid installing the repo's optional TensorFlow/Keras path on Windows, use:

powershell -ExecutionPolicy Bypass -File .\scripts\setup_windows_rocm_venv.ps1 -SkipTensorFlow

Native Windows ROCm currently applies to the PyTorch path in this project. TensorFlow ROCm support is still Linux-oriented in AMD's documentation, so generic Keras/TensorFlow model paths on Windows should be treated as CPU-only unless you move those workflows to Linux or WSL.

Install Dependencies

For CPU-only use, install the requirements directly:

python -m pip install -r requirements.txt

For an NVIDIA, CUDA, or ROCm Linux machine, install the PyTorch wheel recommended by the official PyTorch selector first:

# Choose the exact command for your OS, Python version, and GPU from:
# https://pytorch.org/get-started/locally/
python -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
python -m pip install -r requirements.txt

The requirements.txt file uses normal package constraints for torch, torchvision, and torchaudio, so a compatible GPU build installed first should remain installed. On Windows ROCm, use .\scripts\setup_windows_rocm_venv.ps1 so the AMD ROCm wheels are installed before the shared requirements file.

Verify the key packages:

python - <<'PY'
import pandas
import torch
import transformers
import datasets

print("pandas", pandas.__version__)
print("torch", torch.__version__)
print("cuda_available", torch.cuda.is_available())
print("transformers", transformers.__version__)
print("datasets", datasets.__version__)
PY

Optional Environment Variables

Set these before running large jobs if you want caches and outputs on a fast disk with enough space:

export MLAAS_OUTDIR=/mnt/fast/mlaas-outputs
export HF_HOME=/mnt/fast/huggingface
export HF_DATASETS_CACHE=$HF_HOME/datasets

If you need private Hugging Face models or datasets, export a token:

export HF_TOKEN=<your-token>

For the service loop, you can also put the token in a repo-local .hf_token file. The manifest runner loads that file automatically before executing services. Environment variables still take precedence, so HF_TOKEN or HUGGING_FACE_HUB_TOKEN will override the file when set.

CLI

Run commands from the repository root:

python -m mlaas_data_generator.cli.main <command> [options]

Commands:

Command Purpose
hf-manifest Build reviewed service rows from the model and dataset registries.
run-manifest Validate or execute reviewed service rows.

Check the installed CLI:

python -m mlaas_data_generator.cli.main --help
python -m mlaas_data_generator.cli.main hf-manifest --help
python -m mlaas_data_generator.cli.main run-manifest --help

Build A Manifest

The manifest builder reads:

  • mlaas_data_generator/registry/models.py
  • mlaas_data_generator/registry/datasets.py

Start small on a new machine:

mkdir -p outputs

python -m mlaas_data_generator.cli.main hf-manifest \
  --manifest-profile test \
  --resource-tier light \
  --task-keys text_classification,image_classification,tabular_regression \
  --models-per-task 4 \
  --datasets-per-model 1 \
  --training-regimes finetune_transfer,inference_only \
  --dataset-variants-per-pair 1 \
  --split-variants-per-pair 1 \
  --knob-variants-per-pair 2 \
  --total-services 8 \
  --output outputs/service_manifest.xlsx

This writes an Excel workbook with a services sheet and a defaults sheet. By default, hf-manifest now uses a fresh random seed on each run, so repeated invocations produce slightly different reviewed manifests. Pass --seed <int> when you want a reproducible manifest.

Useful manifest profiles:

Profile Use case
test Small smoke runs for a new machine.
balanced Moderate sample sizes and runtime.
benchmark Larger runs for stronger hardware.

Common task keys include:

Task key Typical workload
text_classification Text sequence classification.
token_classification Named entity or token label tasks.
sentence_similarity Pair scoring and similarity.
fill_mask Masked language modelling.
text_generation Causal language modelling.
text2text_generation Summarisation and sequence-to-sequence generation.
image_classification Image classification.
object_detection Object detection.
image_segmentation Segmentation.
image_captioning Image-to-text generation.
text_image_retrieval Image/text retrieval.
visual_question_answering VQA.
tabular_regression Generic tabular regression service rows.

Review The Manifest

Open outputs/service_manifest.xlsx before executing it.

Important columns:

Column Purpose
enabled Set to false to skip a row. Missing values default to enabled.
service_id Primary service identifier. Missing values are generated deterministically.
case_name Human-readable model/dataset/regime label.
dataset, dataset_name, dataset_config Dataset source and provider identifiers.
model_type, hf_model_id, hf_task Runner and model identifiers.
task_type, task, task_tag, modality Functional compatibility attributes.
train_split, test_split, benchmark_split Training and benchmark split names.
training_regime finetune_transfer, inference_only, or generic.
resource_tier Workload budget: light, medium, heavy, or stress_test.
training_epochs, batch_size, learning_rate, optimizer Training and runtime knobs.
max_samples, max_length, timeout_s, max_train_time_s, max_eval_time_s, device Workload and runtime controls.
input_schema, output_schema Compatibility metadata for later composition work.

For first runs on a new computer, reduce risk by keeping max_samples low, using --manifest-profile test, and setting enabled=false for rows you do not want to run yet.

--resource-tier controls model, dataset, and knob selection. If omitted, it follows the profile: test -> light, balanced -> medium, and benchmark -> heavy. Use stress_test only when you intentionally want the largest allowed services.

For GPU runs, leave device blank or set it to auto unless you need to force a device. PyTorch exposes ROCm devices through the torch.cuda API, so the runner will still resolve a supported AMD ROCm GPU as cuda.

On multi-GPU Linux machines, this project uses GPUs by pinning worker processes to individual GPUs. With grouped HF execution enabled, different Hugging Face model groups can be scheduled onto different GPUs while still reusing prepared models and datasets inside each group.

For a 2-GPU NVIDIA Linux VM, use:

python -m mlaas_data_generator.cli.main run-manifest \
  --file outputs/service_manifest.xlsx \
  --sheet services \
  --db outputs/services.db \
  --workers 2

--workers 2 starts two GPU-pinned worker processes. On grouped HF runs, different HF groups can run on different GPUs. On row-local runs, independent rows can run on different GPUs.

When GPU-parallel worker processes are used, each GPU writes to its own SQLite database file to avoid concurrent writes into the same SQLite database. For example, --db outputs/services.db --workers 2 will produce files such as outputs/services.gpu0.db and outputs/services.gpu1.db.

During run-manifest, the CLI now shows a live manifest progress footer with completed rows out of total enabled rows plus current worker/GPU assignments. In multi-worker runs, progress is counted from completed manifest rows rather than submitted tasks, so grouped HF workers and parallel GPU processes still report accurate manifest-level completion.

Use --no-grouped-hf only when you want the most aggressive row-level parallelism and are willing to trade away grouped model/dataset reuse. Keeping grouped HF enabled is usually the better default when many rows share the same HF model.

CSV manifests can include a row with service_id=defaults. XLSX manifests can include a defaults sheet.

Validate A Manifest

Dry-run validation does not train models. It normalizes column names, applies defaults, validates enabled rows, resolves missing service_id values, and writes outputs/service_manifest_results.csv.

python -m mlaas_data_generator.cli.main run-manifest \
  --file outputs/service_manifest.xlsx \
  --sheet services \
  --dry-run

If validation fails, check:

  • missing required columns such as dataset, model_type, or task_type
  • invalid training_regime
  • missing hf_model_id or hf_task for Hugging Face rows
  • stale sheet names if you changed --sheet

Run The Program

After the dry run succeeds, execute the enabled service rows:

python -m mlaas_data_generator.cli.main run-manifest \
  --file outputs/service_manifest.xlsx \
  --sheet services \
  --db outputs/services.db

For the 2-GPU Linux VM shown above, a good default is:

python -m mlaas_data_generator.cli.main run-manifest \
  --file outputs/service_manifest.xlsx \
  --sheet services \
  --db outputs/services.db \
  --workers 2

If you explicitly want to maximize row-level spreading instead of grouped reuse, use:

python -m mlaas_data_generator.cli.main run-manifest \
  --file outputs/service_manifest.xlsx \
  --sheet services \
  --db outputs/services.db \
  --workers 2 \
  --no-grouped-hf

To confirm both GPUs are active while the run is in progress:

watch -n 1 nvidia-smi

Or capture a compact view:

nvidia-smi --query-gpu=index,name,utilization.gpu,memory.used,memory.total --format=csv -l 1

The run writes:

Path Contents
outputs/services.db SQLite database containing service records and metrics.
outputs/services.gpu0.db, outputs/services.gpu1.db, ... Per-GPU SQLite databases created automatically during GPU-parallel runs.
outputs/service_manifest_results.csv Per-row success/failure summary.
outputs/service_manifest_failed.csv Retry manifest containing only failed row-level runs from the latest manifest execution.
outputs/service_failures.log Detailed validation or runtime failures.

Successful rows are written to the SQLite database configured by CONFIG["db_path"], MLAAS_DB_PATH, MLAAS_SQL_DB_PATH, or the --db override.

Database Tables

The active schema is service-only:

Table Contents
services One row per manifest service instance.
service_metrics Typed quality, QoS, latency, runtime, resource, cost, reliability, explainability, and metadata metrics.
service_artifacts Optional model, report, or output artifact references.
service_split_provenance Optional split and distribution provenance.
service_failures Validation and execution failure details.

There are no active federated workflow or model-averaging tables.

Query Results

Use SQLite directly:

sqlite3 outputs/services.db ".tables"
sqlite3 outputs/services.db "select service_id, status, task_type, training_regime from services limit 10;"

Or load results in Python:

python - <<'PY'
import sqlite3
import pandas as pd

conn = sqlite3.connect("outputs/services.db")
df = pd.read_sql_query("select * from services limit 10", conn)
print(df)
PY

Scaling Up On A More Powerful Machine

After the smoke run works:

  1. Increase --total-services.
  2. Move from --manifest-profile test / --resource-tier light to balanced / medium or benchmark / heavy.
  3. Add more --task-keys.
  4. Increase --models-per-task or --datasets-per-model.
  5. Increase max_samples in the manifest or use --avg-sample-size.

Example larger manifest:

python -m mlaas_data_generator.cli.main hf-manifest \
  --manifest-profile balanced \
  --resource-tier medium \
  --task-keys text_classification,token_classification,sentence_similarity,image_classification,object_detection \
  --models-per-task 8 \
  --datasets-per-model 2 \
  --training-regimes finetune_transfer,inference_only \
  --dataset-variants-per-pair 1 \
  --split-variants-per-pair 1 \
  --knob-variants-per-pair 2 \
  --total-services 40 \
  --output outputs/service_manifest_balanced.xlsx

Validate it:

python -m mlaas_data_generator.cli.main run-manifest \
  --file outputs/service_manifest_balanced.xlsx \
  --sheet services \
  --dry-run

Run it:

python -m mlaas_data_generator.cli.main run-manifest \
  --file outputs/service_manifest_balanced.xlsx \
  --sheet services \
  --db outputs/services_balanced.db

Tests

Install requirements first, then run:

python -m pytest mlaas_data_generator/test

Focused checks:

python -m pytest \
  mlaas_data_generator/test/test_service_manifest_pipeline.py \
  mlaas_data_generator/test/test_service_storage.py \
  mlaas_data_generator/test/test_service_runner.py

Troubleshooting

If torch.cuda.is_available() is False, check the installed PyTorch build first. On NVIDIA, verify the CUDA install you selected. On AMD Windows ROCm, verify that you used .\scripts\setup_windows_rocm_venv.ps1, that Python is 3.12, and that the installed wheel version matches AMD's current Windows ROCm support matrix.

If a Hugging Face dataset or model fails to download, check internet access, disk space, HF_HOME, HF_DATASETS_CACHE, and whether the model or dataset requires HF_TOKEN.

If Excel output fails, confirm openpyxl is installed in the active virtual environment:

python -m pip show openpyxl

If a run fails partway through, inspect:

tail -n 80 outputs/service_failures.log
python - <<'PY'
import sqlite3
import pandas as pd

conn = sqlite3.connect("outputs/services.db")
print(pd.read_sql_query("select * from service_failures order by failure_id desc limit 10", conn))
PY

To rerun only failed rows after fixing the cause, use:

python -m mlaas_data_generator.cli.main run-manifest \
  --file outputs/service_manifest_failed.csv \
  --db outputs/services.db

Extending

  • Add HF models in mlaas_data_generator/registry/models.py.
  • Add HF datasets in mlaas_data_generator/registry/datasets.py.
  • Keep new execution behavior row-local: one manifest row produces one independent service record.
  • Add future composition logic in a separate layer that reads the service table; do not couple composition to service generation.

About

MLaaS Service Dataset Generator builds reviewed service manifests from Hugging Face model and dataset registries, executes each service row, evaluates model performance, captures runtime and resource metrics, and stores the results in SQLite for analysis.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors