Share Model and Dataset via Hugging Face

# Share Model and Dataset via Hugging Face

## Background

[Hugging Face Hub](https://huggingface.co/) is the premier platform for sharing machine learning models and datasets with the open-source community. By uploading the Yaak PII detection model and dataset to Hugging Face, we can:

✅ **Increase visibility** - Make the model discoverable to researchers and practitioners  
✅ **Enable easy usage** - One-line model loading with `transformers`  
✅ **Version control** - Track model and dataset versions over time  
✅ **Community collaboration** - Allow others to fine-tune and improve  
✅ **Standardization** - Follow ML community best practices  
✅ **Reproducibility** - Provide transparent access to training data  

This issue proposes creating automated scripts to upload:
1. **Signed model** (from Issue #30) to Hugging Face Model Hub
2. **Dataset + Dataset Card** (from Issue #32) to Hugging Face Datasets

---

## Implementation Plan

### Phase 1: Setup Hugging Face Integration

**File**: `pyproject.toml`

Add Hugging Face Hub dependencies:

```toml
[project.optional-dependencies]
# Add to existing sections
huggingface = [
    "huggingface-hub>=0.20.0",
    "datasets>=2.0.0",
]
```

**Setup authentication:**

```bash
# Install huggingface_hub
pip install huggingface-hub

# Login (interactive - will open browser)
huggingface-cli login

# Or set token as environment variable
export HUGGING_FACE_HUB_TOKEN="hf_your_token_here"
```

### Phase 2: Model Upload Script

**File**: `src/scripts/upload_model_to_hf.py` (new)

See the complete implementation in the repository at `src/scripts/upload_model_to_hf.py`

**Key features:**
- Validates model directory and required files
- Creates comprehensive model card with metadata
- Uploads model to Hugging Face Hub
- Includes model signature hash for verification
- Supports both PyTorch (SafeTensors) and ONNX formats

**Usage:**

```bash
# Test with private repo
python src/scripts/upload_model_to_hf.py \
    --model-path model/trained \
    --repo-id your-username/yaak-pii-detector \
    --private

# Upload quantized ONNX model
python src/scripts/upload_model_to_hf.py \
    --model-path model/quantized \
    --repo-id your-username/yaak-pii-detector-onnx \
    --model-type onnx \
    --private
```

### Phase 3: Dataset Upload Script

**File**: `src/scripts/upload_dataset_to_hf.py` (new)

See the complete implementation in the repository at `src/scripts/upload_dataset_to_hf.py`

**Key features:**
- Converts internal JSON format to Hugging Face Datasets format
- Creates automatic train/validation/test splits (80/10/10)
- Generates comprehensive dataset card
- Validates dataset structure
- Supports sample limiting for testing

**Usage:**

```bash
# Test with 1000 samples (private)
python src/scripts/upload_dataset_to_hf.py \
    --data-path model/dataset/reviewed_samples \
    --repo-id your-username/yaak-pii-dataset \
    --private \
    --limit 1000

# Full upload
python src/scripts/upload_dataset_to_hf.py \
    --data-path model/dataset/reviewed_samples \
    --repo-id yaak/pii-detection-dataset
```

### Phase 4: Update Makefile

**File**: `Makefile`

Add targets for Hugging Face uploads:

```makefile
# Hugging Face upload targets
.PHONY: hf-login hf-upload-model-test hf-upload-dataset-test

hf-login:  ## Login to Hugging Face Hub
\t@echo "Logging in to Hugging Face Hub..."
\thuggingface-cli login

hf-upload-model-test:  ## Upload PyTorch model to HF (test with private repo)
\t@echo "Testing model upload with private repository"
\t@read -p "Enter your HF username: " username; \\
\tpython3 src/scripts/upload_model_to_hf.py \\
\t\t--model-path model/trained \\
\t\t--repo-id $$username/yaak-pii-detector-test \\
\t\t--model-type pytorch \\
\t\t--private

hf-upload-dataset-test:  ## Upload dataset to HF (test with 1000 samples)
\t@echo "Testing dataset upload with 1000 samples"
\t@read -p "Enter your HF username: " username; \\
\tpython3 src/scripts/upload_dataset_to_hf.py \\
\t\t--data-path model/dataset/reviewed_samples \\
\t\t--repo-id $$username/yaak-pii-dataset-test \\
\t\t--limit 1000 \\
\t\t--private
```

---

## Testing Workflow

### 1. Initial Setup

```bash
# Install dependencies
pip install -e ".[huggingface]"

# Login to Hugging Face
make hf-login
```

### 2. Test Model Upload (Private)

```bash
# Test PyTorch model upload
make hf-upload-model-test
```

**Verify:**
1. Visit `https://huggingface.co/YOUR_USERNAME/yaak-pii-detector-test`
2. Check that all files are present
3. Test loading:

```python
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained(
    "YOUR_USERNAME/yaak-pii-detector-test"
)
```

### 3. Test Dataset Upload (Private)

```bash
make hf-upload-dataset-test
```

**Verify:**
1. Visit `https://huggingface.co/datasets/YOUR_USERNAME/yaak-pii-dataset-test`
2. Check dataset card and viewer
3. Test loading:

```python
from datasets import load_dataset

dataset = load_dataset("YOUR_USERNAME/yaak-pii-dataset-test")
print(dataset['train'][0])
```

---

## Model Card Structure

The generated model card includes:

- **Metadata**: Language tags, license, datasets, metrics
- **Model Description**: Architecture, PII types, training data
- **Usage Examples**: PyTorch and ONNX inference code
- **Model Verification**: SHA-256 hash (if model signing enabled)
- **Intended Use & Limitations**: Clear guidance on appropriate usage
- **Bias Discussion**: Known biases and mitigation strategies
- **Training Details**: Hyperparameters, framework versions
- **Citation**: BibTeX format for academic use

## Dataset Card Structure

The generated dataset card includes:

- **Metadata**: Languages, license, task categories, size
- **Dataset Summary**: Statistics for each split
- **Structure**: Data fields, examples, splits
- **Creation**: Curation process, annotation methods
- **Considerations**: Social impact, biases, limitations
- **Usage Examples**: Loading and training code
- **Citation**: BibTeX format

---

## Integration with Other Issues

### Issue #30 (Model Signing)
The model upload script automatically includes the signature hash:

```bash
# Generate signature first
python -m model.src.sign_model model/trained

# Upload will include hash in model card
python src/scripts/upload_model_to_hf.py --model-path model/trained --repo-id yaak/pii-detector
```

### Issue #32 (Dataset Card)
The dataset statistics can inform the upload:

```bash
# Generate statistics first
make dataset-stats

# Use insights to update dataset card
python src/scripts/upload_dataset_to_hf.py --data-path model/dataset/reviewed_samples --repo-id yaak/pii-dataset
```

---

## Success Criteria

- [ ] Hugging Face dependencies added to `pyproject.toml`
- [ ] Model upload script created with:
  - [ ] Model validation
  - [ ] Automatic model card generation
  - [ ] PyTorch and ONNX support
  - [ ] Model signature integration
  - [ ] Private/public repository options
- [ ] Dataset upload script created with:
  - [ ] Automatic train/val/test splits
  - [ ] HF Datasets format conversion
  - [ ] Dataset card generation
  - [ ] Sample limiting for testing
- [ ] Makefile targets added
- [ ] Successfully test upload to private repository
- [ ] Model loadable via `transformers`
- [ ] Dataset loadable via `datasets`
- [ ] Documentation updated

---

## Script Implementation Details

Both scripts (`upload_model_to_hf.py` and `upload_dataset_to_hf.py`) should include:

1. **ModelUploader/DatasetUploader classes**:
   - Token management and authentication
   - Validation of input data
   - Card generation from templates
   - Upload with progress tracking

2. **Comprehensive model/dataset cards** following Hugging Face standards

3. **Error handling** with helpful messages

4. **CLI interface** with argparse

The full implementation is approximately 400-500 lines per script.

---

## Future Enhancements

1. **Automated uploads**: CI/CD pipeline to auto-upload on releases
2. **Version tagging**: Semantic versioning for models and datasets
3. **Model zoo**: Upload multiple model variants (base, large, quantized)
4. **Benchmark integration**: Automatic evaluation on Hugging Face Spaces
5. **Community contributions**: Accept fine-tuned model submissions

---

## References

- [Hugging Face Hub Documentation](https://huggingface.co/docs/hub/index)
- [Uploading Models](https://huggingface.co/docs/hub/models-uploading)
- [Uploading Datasets](https://huggingface.co/docs/hub/datasets-uploading)
- [Model Cards](https://huggingface.co/docs/hub/model-cards)
- [Dataset Cards](https://huggingface.co/docs/hub/datasets-cards)
- [huggingface_hub Python Library](https://huggingface.co/docs/huggingface_hub/index)

---

## Notes

This is marked as a "good first issue" because:
- Clear step-by-step implementation with complete code
- Standard tooling (Hugging Face Hub)
- Safe testing with private repositories
- Immediate value (public model/dataset sharing)
- Good introduction to ML model distribution
- Integrates with existing issues (#30 for model signing, #32 for dataset card)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Share Model and Dataset via Hugging Face #33

Share Model and Dataset via Hugging Face

Background

Implementation Plan

Phase 1: Setup Hugging Face Integration

Phase 2: Model Upload Script

Phase 3: Dataset Upload Script

Phase 4: Update Makefile

Testing Workflow

1. Initial Setup

2. Test Model Upload (Private)

3. Test Dataset Upload (Private)

Model Card Structure

Dataset Card Structure

Integration with Other Issues

Issue #30 (Model Signing)

Issue #32 (Dataset Card)

Success Criteria

Script Implementation Details

Future Enhancements

References

Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Share Model and Dataset via Hugging Face #33

Description

Share Model and Dataset via Hugging Face

Background

Implementation Plan

Phase 1: Setup Hugging Face Integration

Phase 2: Model Upload Script

Phase 3: Dataset Upload Script

Phase 4: Update Makefile

Testing Workflow

1. Initial Setup

2. Test Model Upload (Private)

3. Test Dataset Upload (Private)

Model Card Structure

Dataset Card Structure

Integration with Other Issues

Issue #30 (Model Signing)

Issue #32 (Dataset Card)

Success Criteria

Script Implementation Details

Future Enhancements

References

Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions