You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hugging Face Hub is the premier platform for sharing machine learning models and datasets with the open-source community. By uploading the Yaak PII detection model and dataset to Hugging Face, we can:
✅ Increase visibility - Make the model discoverable to researchers and practitioners
✅ Enable easy usage - One-line model loading with transformers
✅ Version control - Track model and dataset versions over time
✅ Community collaboration - Allow others to fine-tune and improve
✅ Standardization - Follow ML community best practices
✅ Reproducibility - Provide transparent access to training data
This issue proposes creating automated scripts to upload:
# Install huggingface_hub
pip install huggingface-hub
# Login (interactive - will open browser)
huggingface-cli login
# Or set token as environment variableexport HUGGING_FACE_HUB_TOKEN="hf_your_token_here"
Phase 2: Model Upload Script
File: src/scripts/upload_model_to_hf.py (new)
See the complete implementation in the repository at src/scripts/upload_model_to_hf.py
Key features:
Validates model directory and required files
Creates comprehensive model card with metadata
Uploads model to Hugging Face Hub
Includes model signature hash for verification
Supports both PyTorch (SafeTensors) and ONNX formats
The model upload script automatically includes the signature hash:
# Generate signature first
python -m model.src.sign_model model/trained
# Upload will include hash in model card
python src/scripts/upload_model_to_hf.py --model-path model/trained --repo-id yaak/pii-detector
# Generate statistics first
make dataset-stats
# Use insights to update dataset card
python src/scripts/upload_dataset_to_hf.py --data-path model/dataset/reviewed_samples --repo-id yaak/pii-dataset
Success Criteria
Hugging Face dependencies added to pyproject.toml
Model upload script created with:
Model validation
Automatic model card generation
PyTorch and ONNX support
Model signature integration
Private/public repository options
Dataset upload script created with:
Automatic train/val/test splits
HF Datasets format conversion
Dataset card generation
Sample limiting for testing
Makefile targets added
Successfully test upload to private repository
Model loadable via transformers
Dataset loadable via datasets
Documentation updated
Script Implementation Details
Both scripts (upload_model_to_hf.py and upload_dataset_to_hf.py) should include:
ModelUploader/DatasetUploader classes:
Token management and authentication
Validation of input data
Card generation from templates
Upload with progress tracking
Comprehensive model/dataset cards following Hugging Face standards
Error handling with helpful messages
CLI interface with argparse
The full implementation is approximately 400-500 lines per script.
Future Enhancements
Automated uploads: CI/CD pipeline to auto-upload on releases
Version tagging: Semantic versioning for models and datasets
Model zoo: Upload multiple model variants (base, large, quantized)
Benchmark integration: Automatic evaluation on Hugging Face Spaces
Community contributions: Accept fine-tuned model submissions
Share Model and Dataset via Hugging Face
Background
Hugging Face Hub is the premier platform for sharing machine learning models and datasets with the open-source community. By uploading the Yaak PII detection model and dataset to Hugging Face, we can:
✅ Increase visibility - Make the model discoverable to researchers and practitioners
✅ Enable easy usage - One-line model loading with
transformers✅ Version control - Track model and dataset versions over time
✅ Community collaboration - Allow others to fine-tune and improve
✅ Standardization - Follow ML community best practices
✅ Reproducibility - Provide transparent access to training data
This issue proposes creating automated scripts to upload:
Implementation Plan
Phase 1: Setup Hugging Face Integration
File:
pyproject.tomlAdd Hugging Face Hub dependencies:
Setup authentication:
Phase 2: Model Upload Script
File:
src/scripts/upload_model_to_hf.py(new)See the complete implementation in the repository at
src/scripts/upload_model_to_hf.pyKey features:
Usage:
Phase 3: Dataset Upload Script
File:
src/scripts/upload_dataset_to_hf.py(new)See the complete implementation in the repository at
src/scripts/upload_dataset_to_hf.pyKey features:
Usage:
Phase 4: Update Makefile
File:
MakefileAdd targets for Hugging Face uploads:
Testing Workflow
1. Initial Setup
2. Test Model Upload (Private)
# Test PyTorch model upload make hf-upload-model-testVerify:
https://huggingface.co/YOUR_USERNAME/yaak-pii-detector-test3. Test Dataset Upload (Private)
Verify:
https://huggingface.co/datasets/YOUR_USERNAME/yaak-pii-dataset-testModel Card Structure
The generated model card includes:
Dataset Card Structure
The generated dataset card includes:
Integration with Other Issues
Issue #30 (Model Signing)
The model upload script automatically includes the signature hash:
Issue #32 (Dataset Card)
The dataset statistics can inform the upload:
Success Criteria
pyproject.tomltransformersdatasetsScript Implementation Details
Both scripts (
upload_model_to_hf.pyandupload_dataset_to_hf.py) should include:ModelUploader/DatasetUploader classes:
Comprehensive model/dataset cards following Hugging Face standards
Error handling with helpful messages
CLI interface with argparse
The full implementation is approximately 400-500 lines per script.
Future Enhancements
References
Notes
This is marked as a "good first issue" because: