Skip to content

Share Model and Dataset via Hugging Face #33

@hanneshapke

Description

@hanneshapke

Share Model and Dataset via Hugging Face

Background

Hugging Face Hub is the premier platform for sharing machine learning models and datasets with the open-source community. By uploading the Yaak PII detection model and dataset to Hugging Face, we can:

Increase visibility - Make the model discoverable to researchers and practitioners
Enable easy usage - One-line model loading with transformers
Version control - Track model and dataset versions over time
Community collaboration - Allow others to fine-tune and improve
Standardization - Follow ML community best practices
Reproducibility - Provide transparent access to training data

This issue proposes creating automated scripts to upload:

  1. Signed model (from Issue Add Model Signing to ML Pipeline #30) to Hugging Face Model Hub
  2. Dataset + Dataset Card (from Issue Generate Dataset Statistics and Dataset Card #32) to Hugging Face Datasets

Implementation Plan

Phase 1: Setup Hugging Face Integration

File: pyproject.toml

Add Hugging Face Hub dependencies:

[project.optional-dependencies]
# Add to existing sections
huggingface = [
    "huggingface-hub>=0.20.0",
    "datasets>=2.0.0",
]

Setup authentication:

# Install huggingface_hub
pip install huggingface-hub

# Login (interactive - will open browser)
huggingface-cli login

# Or set token as environment variable
export HUGGING_FACE_HUB_TOKEN="hf_your_token_here"

Phase 2: Model Upload Script

File: src/scripts/upload_model_to_hf.py (new)

See the complete implementation in the repository at src/scripts/upload_model_to_hf.py

Key features:

  • Validates model directory and required files
  • Creates comprehensive model card with metadata
  • Uploads model to Hugging Face Hub
  • Includes model signature hash for verification
  • Supports both PyTorch (SafeTensors) and ONNX formats

Usage:

# Test with private repo
python src/scripts/upload_model_to_hf.py \
    --model-path model/trained \
    --repo-id your-username/yaak-pii-detector \
    --private

# Upload quantized ONNX model
python src/scripts/upload_model_to_hf.py \
    --model-path model/quantized \
    --repo-id your-username/yaak-pii-detector-onnx \
    --model-type onnx \
    --private

Phase 3: Dataset Upload Script

File: src/scripts/upload_dataset_to_hf.py (new)

See the complete implementation in the repository at src/scripts/upload_dataset_to_hf.py

Key features:

  • Converts internal JSON format to Hugging Face Datasets format
  • Creates automatic train/validation/test splits (80/10/10)
  • Generates comprehensive dataset card
  • Validates dataset structure
  • Supports sample limiting for testing

Usage:

# Test with 1000 samples (private)
python src/scripts/upload_dataset_to_hf.py \
    --data-path model/dataset/reviewed_samples \
    --repo-id your-username/yaak-pii-dataset \
    --private \
    --limit 1000

# Full upload
python src/scripts/upload_dataset_to_hf.py \
    --data-path model/dataset/reviewed_samples \
    --repo-id yaak/pii-detection-dataset

Phase 4: Update Makefile

File: Makefile

Add targets for Hugging Face uploads:

# Hugging Face upload targets
.PHONY: hf-login hf-upload-model-test hf-upload-dataset-test

hf-login:  ## Login to Hugging Face Hub
\t@echo "Logging in to Hugging Face Hub..."
\thuggingface-cli login

hf-upload-model-test:  ## Upload PyTorch model to HF (test with private repo)
\t@echo "Testing model upload with private repository"
\t@read -p "Enter your HF username: " username; \\
\tpython3 src/scripts/upload_model_to_hf.py \\
\t\t--model-path model/trained \\
\t\t--repo-id $$username/yaak-pii-detector-test \\
\t\t--model-type pytorch \\
\t\t--private

hf-upload-dataset-test:  ## Upload dataset to HF (test with 1000 samples)
\t@echo "Testing dataset upload with 1000 samples"
\t@read -p "Enter your HF username: " username; \\
\tpython3 src/scripts/upload_dataset_to_hf.py \\
\t\t--data-path model/dataset/reviewed_samples \\
\t\t--repo-id $$username/yaak-pii-dataset-test \\
\t\t--limit 1000 \\
\t\t--private

Testing Workflow

1. Initial Setup

# Install dependencies
pip install -e ".[huggingface]"

# Login to Hugging Face
make hf-login

2. Test Model Upload (Private)

# Test PyTorch model upload
make hf-upload-model-test

Verify:

  1. Visit https://huggingface.co/YOUR_USERNAME/yaak-pii-detector-test
  2. Check that all files are present
  3. Test loading:
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained(
    "YOUR_USERNAME/yaak-pii-detector-test"
)

3. Test Dataset Upload (Private)

make hf-upload-dataset-test

Verify:

  1. Visit https://huggingface.co/datasets/YOUR_USERNAME/yaak-pii-dataset-test
  2. Check dataset card and viewer
  3. Test loading:
from datasets import load_dataset

dataset = load_dataset("YOUR_USERNAME/yaak-pii-dataset-test")
print(dataset['train'][0])

Model Card Structure

The generated model card includes:

  • Metadata: Language tags, license, datasets, metrics
  • Model Description: Architecture, PII types, training data
  • Usage Examples: PyTorch and ONNX inference code
  • Model Verification: SHA-256 hash (if model signing enabled)
  • Intended Use & Limitations: Clear guidance on appropriate usage
  • Bias Discussion: Known biases and mitigation strategies
  • Training Details: Hyperparameters, framework versions
  • Citation: BibTeX format for academic use

Dataset Card Structure

The generated dataset card includes:

  • Metadata: Languages, license, task categories, size
  • Dataset Summary: Statistics for each split
  • Structure: Data fields, examples, splits
  • Creation: Curation process, annotation methods
  • Considerations: Social impact, biases, limitations
  • Usage Examples: Loading and training code
  • Citation: BibTeX format

Integration with Other Issues

Issue #30 (Model Signing)

The model upload script automatically includes the signature hash:

# Generate signature first
python -m model.src.sign_model model/trained

# Upload will include hash in model card
python src/scripts/upload_model_to_hf.py --model-path model/trained --repo-id yaak/pii-detector

Issue #32 (Dataset Card)

The dataset statistics can inform the upload:

# Generate statistics first
make dataset-stats

# Use insights to update dataset card
python src/scripts/upload_dataset_to_hf.py --data-path model/dataset/reviewed_samples --repo-id yaak/pii-dataset

Success Criteria

  • Hugging Face dependencies added to pyproject.toml
  • Model upload script created with:
    • Model validation
    • Automatic model card generation
    • PyTorch and ONNX support
    • Model signature integration
    • Private/public repository options
  • Dataset upload script created with:
    • Automatic train/val/test splits
    • HF Datasets format conversion
    • Dataset card generation
    • Sample limiting for testing
  • Makefile targets added
  • Successfully test upload to private repository
  • Model loadable via transformers
  • Dataset loadable via datasets
  • Documentation updated

Script Implementation Details

Both scripts (upload_model_to_hf.py and upload_dataset_to_hf.py) should include:

  1. ModelUploader/DatasetUploader classes:

    • Token management and authentication
    • Validation of input data
    • Card generation from templates
    • Upload with progress tracking
  2. Comprehensive model/dataset cards following Hugging Face standards

  3. Error handling with helpful messages

  4. CLI interface with argparse

The full implementation is approximately 400-500 lines per script.


Future Enhancements

  1. Automated uploads: CI/CD pipeline to auto-upload on releases
  2. Version tagging: Semantic versioning for models and datasets
  3. Model zoo: Upload multiple model variants (base, large, quantized)
  4. Benchmark integration: Automatic evaluation on Hugging Face Spaces
  5. Community contributions: Accept fine-tuned model submissions

References


Notes

This is marked as a "good first issue" because:

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions