Background
The project currently maintains three separate dataset directories with different stages of data processing:
Dataset Directory Structure
model/dataset/samples/ (~20,000 files): Raw samples generated by LLM
model/dataset/reviewed_samples/ (~20,000 files): Samples reviewed and corrected by LLM
model/dataset/training_samples/ (~6,000 files): Final processed samples used for model training
Current Data Format
Each JSON file contains:
{
"text": "Fatima Khaled resides at 2114 Cedar Crescent in Marseille...",
"privacy_mask": [
{"value": "Fatima", "label": "FIRSTNAME"},
{"value": "Khaled", "label": "SURNAME"},
{"value": "2114", "label": "BUILDINGNUM"}
],
"coreferences": [
{
"cluster_id": 0,
"mentions": ["Fatima Khaled", "Her", "she"],
"entity_type": "person"
}
]
}
Key characteristics:
- Labels are direct (e.g.,
FIRSTNAME, SURNAME) without BIO prefixes (no B-/I- notation)
- Co-references track entity mentions across the text
- PII labels include: SURNAME, FIRSTNAME, EMAIL, PHONENUMBER, ADDRESS, CITY, SSN, DRIVER_LICENSE, etc.
Challenge
Currently, there's no systematic human review process for validating:
- NER extraction accuracy (are PII entities correctly identified?)
- Label correctness (is "Fatima" correctly labeled as FIRSTNAME?)
- Co-reference resolution quality (are mentions properly grouped?)
This issue proposes integrating LabelStudio for efficient human review and quality assurance of the dataset.
Implementation Plan
Phase 1: Setup LabelStudio with Docker
File: docker-compose.labelstudio.yml (new)
Create a dedicated Docker Compose file for LabelStudio:
version: '3.8'
services:
labelstudio:
image: heartexlabs/label-studio:latest
container_name: yaak-labelstudio
ports:
- "8081:8080"
volumes:
# Data storage for LabelStudio projects
- ./data/labelstudio:/label-studio/data
# Mount dataset directories (read-only for safety)
- ./model/dataset/samples:/datasets/samples:ro
- ./model/dataset/reviewed_samples:/datasets/reviewed_samples:ro
- ./model/dataset/training_samples:/datasets/training_samples:ro
# Export directory for reviewed annotations
- ./data/labelstudio_exports:/exports
environment:
- LABEL_STUDIO_LOCAL_FILES_SERVING_ENABLED=true
- LABEL_STUDIO_LOCAL_FILES_DOCUMENT_ROOT=/datasets
# For production, set strong credentials
- LABEL_STUDIO_USERNAME=admin
- LABEL_STUDIO_PASSWORD=changeme123
restart: unless-stopped
networks:
- yaak-network
networks:
yaak-network:
driver: bridge
volumes:
labelstudio-data:
Usage:
# Start LabelStudio
docker-compose -f docker-compose.labelstudio.yml up -d
# Access at http://localhost:8081
# Login with admin/changeme123 (change in production!)
# Stop LabelStudio
docker-compose -f docker-compose.labelstudio.yml down
Phase 2: Create Custom LabelStudio Template
File: data/labelstudio/templates/pii_ner_coref_template.xml (new)
Create a custom annotation template that supports:
- NER annotation without B-/I- prefixes
- Co-reference review for entity mentions
<View>
<Header value="PII Entity and Co-reference Annotation"/>
<Text name="text" value="$text"/>
<!-- NER Annotation Section -->
<View style="box-shadow: 2px 2px 5px #999; padding: 20px; margin-top: 2em; border-radius: 5px;">
<Header value="Named Entity Recognition (PII)" size="4"/>
<Labels name="label" toName="text">
<!-- Person Information -->
<Label value="FIRSTNAME" background="#FFA39E"/>
<Label value="SURNAME" background="#FF7875"/>
<!-- Contact Information -->
<Label value="EMAIL" background="#FFD666"/>
<Label value="PHONENUMBER" background="#FFC53D"/>
<Label value="URL" background="#FFB340"/>
<!-- Location Information -->
<Label value="BUILDINGNUM" background="#95DE64"/>
<Label value="STREET" background="#73D13D"/>
<Label value="CITY" background="#52C41A"/>
<Label value="STATE" background="#389E0D"/>
<Label value="ZIP" background="#237804"/>
<Label value="COUNTRY" background="#135200"/>
<!-- Identification Numbers -->
<Label value="SSN" background="#FF85C0"/>
<Label value="DRIVER_LICENSE" background="#F759AB"/>
<Label value="PASSPORT" background="#EB2F96"/>
<Label value="NATIONAL_ID" background="#C41D7F"/>
<Label value="IDCARDNUM" background="#9E1068"/>
<Label value="TAXNUM" background="#780650"/>
<Label value="LICENSEPLATENUM" background="#520339"/>
<!-- Financial Information -->
<Label value="IBAN" background="#91CAFF"/>
<!-- Company Information -->
<Label value="COMPANYNAME" background="#BAE0FF"/>
<!-- Other -->
<Label value="DATEOFBIRTH" background="#D3ADF7"/>
<Label value="DOB" background="#B37FEB"/>
<Label value="AGE" background="#9254DE"/>
<Label value="PASSWORD" background="#FF4D4F"/>
</Labels>
</View>
<!-- Co-reference Annotation Section -->
<View style="box-shadow: 2px 2px 5px #999; padding: 20px; margin-top: 2em; border-radius: 5px;">
<Header value="Co-reference Resolution" size="4"/>
<Relations name="coreference" toName="text">
<Relation value="COREF_PERSON" background="#69c0ff"/>
<Relation value="COREF_ORGANIZATION" background="#95de64"/>
<Relation value="COREF_LOCATION" background="#ffc069"/>
<Relation value="COREF_OTHER" background="#d3adf7"/>
</Relations>
</View>
<!-- Pre-annotated Data Display -->
<View style="box-shadow: 2px 2px 5px #999; padding: 20px; margin-top: 2em; border-radius: 5px; background: #f5f5f5;">
<Header value="Pre-annotated Data (Reference)" size="4"/>
<Text name="privacy_mask_display" value="$privacy_mask_str"/>
<Text name="coreference_display" value="$coreferences_str"/>
</View>
<!-- Review Notes -->
<View style="margin-top: 2em;">
<Header value="Review Notes" size="4"/>
<TextArea name="notes" toName="text"
placeholder="Add any notes about annotation quality, issues, or corrections..."
rows="3"/>
<Choices name="quality" toName="text" choice="single" showInline="true">
<Choice value="Excellent"/>
<Choice value="Good"/>
<Choice value="Fair"/>
<Choice value="Poor"/>
</Choices>
</View>
</View>
Template Features:
- ✅ Direct label annotation (no B-/I- prefixes needed)
- ✅ Color-coded by category (Person, Contact, Location, ID, Financial)
- ✅ Co-reference relations support
- ✅ Display pre-annotated data for reference
- ✅ Quality rating and notes for each sample
Phase 3: Recommended Storage Structure
Create a new unified storage structure that works for both LabelStudio and model training:
model/
├── dataset/
│ ├── raw/ # Raw generated samples (existing: samples/)
│ ├── reviewed/ # LLM-reviewed samples (existing: reviewed_samples/)
│ ├── training/ # Final training data (existing: training_samples/)
│ └── labelstudio/ # NEW: LabelStudio-specific data
│ ├── import/ # Converted data ready for import
│ ├── export/ # Human-reviewed exports from LabelStudio
│ └── verified/ # Final verified dataset
│
data/ # NEW: Root-level data directory
├── labelstudio/
│ ├── projects/ # LabelStudio project files
│ ├── templates/ # Annotation templates
│ │ └── pii_ner_coref_template.xml
│ └── config/ # LabelStudio configuration
└── labelstudio_exports/ # Export destination
Rationale:
- Separation of concerns: Keep LabelStudio data separate from model data
- Docker-friendly: Easy to mount specific directories
- Version control: Can gitignore large data files while keeping configs
- Workflow clarity: Clear path from raw → reviewed → training → verified
Phase 4: Dataset Structure Changes
Proposed changes for LabelStudio compatibility:
- Add task IDs: Each sample needs a unique ID for LabelStudio tracking
- Flatten structure: Convert nested JSON to LabelStudio's expected format
- Pre-annotation support: Include existing labels as predictions
- Metadata: Add creation timestamp, source, version
New format:
{
"id": "20251124103832_fb0dd1a3",
"data": {
"text": "Fatima Khaled resides at 2114 Cedar Crescent...",
"privacy_mask_str": "FIRSTNAME: Fatima | SURNAME: Khaled | BUILDINGNUM: 2114",
"coreferences_str": "Cluster 0 (person): Fatima Khaled, Her, she"
},
"predictions": [{
"model_version": "llm_generated_v1",
"result": [
{
"value": {
"start": 0,
"end": 6,
"text": "Fatima",
"labels": ["FIRSTNAME"]
},
"from_name": "label",
"to_name": "text",
"type": "labels"
}
]
}],
"annotations": [],
"meta": {
"created_at": "2024-11-24T10:38:32Z",
"source": "llm_generation",
"version": "1.0"
}
}
Phase 5: Conversion Script
File: src/scripts/convert_to_labelstudio.py (new)
Create a conversion script to transform existing dataset format to LabelStudio format:
#!/usr/bin/env python3
"""
Convert Yaak PII dataset to LabelStudio import format.
Usage:
python src/scripts/convert_to_labelstudio.py \
--input model/dataset/reviewed_samples \
--output model/dataset/labelstudio/import \
--limit 100
This script converts the internal JSON format to LabelStudio's task format,
including character-level span annotations for NER and co-reference data.
"""
import argparse
import json
import re
from pathlib import Path
from typing import Any, Dict, List
from datetime import datetime
def find_entity_spans(text: str, entity_value: str) -> List[tuple]:
"""
Find all occurrences of entity_value in text and return (start, end) positions.
Args:
text: The full text to search in
entity_value: The entity string to find
Returns:
List of (start, end) tuples for each occurrence
"""
spans = []
# Use word boundary matching for better accuracy
pattern = re.escape(entity_value)
for match in re.finditer(pattern, text):
spans.append((match.start(), match.end()))
return spans
def convert_privacy_mask_to_spans(text: str, privacy_mask: List[Dict]) -> List[Dict]:
"""
Convert privacy_mask entities to LabelStudio span annotations.
Args:
text: The full text
privacy_mask: List of {"value": str, "label": str}
Returns:
List of LabelStudio annotation results
"""
results = []
seen_positions = set() # Track already annotated positions to avoid duplicates
for entity in privacy_mask:
value = entity["value"]
label = entity["label"]
# Find all occurrences of this entity in the text
spans = find_entity_spans(text, value)
for start, end in spans:
# Skip if this position already annotated (handles duplicates)
position_key = (start, end)
if position_key in seen_positions:
continue
seen_positions.add(position_key)
results.append({
"value": {
"start": start,
"end": end,
"text": value,
"labels": [label]
},
"from_name": "label",
"to_name": "text",
"type": "labels"
})
return results
def format_privacy_mask_display(privacy_mask: List[Dict]) -> str:
"""Format privacy_mask for human-readable display."""
items = [f"{item['label']}: {item['value']}" for item in privacy_mask]
return " | ".join(items)
def format_coreferences_display(coreferences: List[Dict]) -> str:
"""Format coreferences for human-readable display."""
if not coreferences:
return "No coreferences"
items = []
for coref in coreferences:
cluster_id = coref["cluster_id"]
entity_type = coref["entity_type"]
mentions = ", ".join(coref["mentions"])
items.append(f"Cluster {cluster_id} ({entity_type}): {mentions}")
return " | ".join(items)
def convert_sample_to_labelstudio(
sample_data: Dict[str, Any],
sample_id: str,
source: str = "llm_reviewed"
) -> Dict[str, Any]:
"""
Convert a single sample to LabelStudio task format.
Args:
sample_data: The sample dictionary with text, privacy_mask, coreferences
sample_id: Unique identifier for this sample
source: Source of the data (e.g., 'llm_reviewed', 'llm_generated')
Returns:
LabelStudio task dictionary
"""
text = sample_data["text"]
privacy_mask = sample_data.get("privacy_mask", [])
coreferences = sample_data.get("coreferences", [])
# Convert entities to LabelStudio span annotations
span_annotations = convert_privacy_mask_to_spans(text, privacy_mask)
# Create LabelStudio task
task = {
"id": sample_id,
"data": {
"text": text,
"privacy_mask_str": format_privacy_mask_display(privacy_mask),
"coreferences_str": format_coreferences_display(coreferences)
},
"predictions": [{
"model_version": source,
"created_at": datetime.utcnow().isoformat() + "Z",
"result": span_annotations
}],
"annotations": [],
"meta": {
"created_at": datetime.utcnow().isoformat() + "Z",
"source": source,
"version": "1.0",
"original_file": sample_id
}
}
return task
def convert_dataset(
input_dir: Path,
output_dir: Path,
limit: int | None = None,
source: str = "llm_reviewed"
) -> int:
"""
Convert all samples in input directory to LabelStudio format.
Args:
input_dir: Directory containing JSON samples
output_dir: Directory to write LabelStudio tasks
limit: Maximum number of files to convert (None for all)
source: Source identifier for metadata
Returns:
Number of files converted
"""
output_dir.mkdir(parents=True, exist_ok=True)
# Get all JSON files
json_files = sorted(input_dir.glob("*.json"))
if limit:
json_files = json_files[:limit]
print(f"Converting {len(json_files)} files from {input_dir}...")
converted_count = 0
for json_file in json_files:
try:
# Read original sample
with open(json_file, 'r', encoding='utf-8') as f:
sample_data = json.load(f)
# Extract sample ID from filename (remove .json extension)
sample_id = json_file.stem
# Convert to LabelStudio format
ls_task = convert_sample_to_labelstudio(sample_data, sample_id, source)
# Write to output
output_file = output_dir / f"{sample_id}.json"
with open(output_file, 'w', encoding='utf-8') as f:
json.dump(ls_task, f, indent=2, ensure_ascii=False)
converted_count += 1
if converted_count % 100 == 0:
print(f"Converted {converted_count}/{len(json_files)} files...")
except Exception as e:
print(f"Error converting {json_file}: {e}")
continue
print(f"✓ Successfully converted {converted_count} files to {output_dir}")
return converted_count
def main():
parser = argparse.ArgumentParser(
description="Convert Yaak PII dataset to LabelStudio format"
)
parser.add_argument(
"--input",
type=Path,
default=Path("model/dataset/reviewed_samples"),
help="Input directory with JSON samples (default: model/dataset/reviewed_samples)"
)
parser.add_argument(
"--output",
type=Path,
default=Path("model/dataset/labelstudio/import"),
help="Output directory for LabelStudio tasks (default: model/dataset/labelstudio/import)"
)
parser.add_argument(
"--limit",
type=int,
default=None,
help="Limit number of files to convert (default: convert all)"
)
parser.add_argument(
"--source",
type=str,
default="llm_reviewed",
help="Source identifier for metadata (default: llm_reviewed)"
)
args = parser.parse_args()
if not args.input.exists():
print(f"Error: Input directory {args.input} does not exist")
return 1
converted_count = convert_dataset(
args.input,
args.output,
args.limit,
args.source
)
print(f"\n{'='*60}")
print(f"Conversion complete!")
print(f"Converted {converted_count} samples")
print(f"Output directory: {args.output}")
print(f"\nNext steps:")
print(f"1. Start LabelStudio: docker-compose -f docker-compose.labelstudio.yml up -d")
print(f"2. Access LabelStudio at http://localhost:8081")
print(f"3. Create a new project and import tasks from {args.output}")
print(f"4. Use the custom template from data/labelstudio/templates/pii_ner_coref_template.xml")
print(f"{'='*60}")
return 0
if __name__ == "__main__":
exit(main())
Make it executable:
chmod +x src/scripts/convert_to_labelstudio.py
Phase 6: Update Makefile
File: Makefile
Add new targets for LabelStudio workflow:
# LabelStudio targets
.PHONY: labelstudio-up labelstudio-down labelstudio-convert labelstudio-convert-sample
labelstudio-up: ## Start LabelStudio with Docker
\t@echo "Starting LabelStudio..."
\tdocker-compose -f docker-compose.labelstudio.yml up -d
\t@echo "LabelStudio is running at http://localhost:8081"
\t@echo "Default login: admin / changeme123"
labelstudio-down: ## Stop LabelStudio
\t@echo "Stopping LabelStudio..."
\tdocker-compose -f docker-compose.labelstudio.yml down
labelstudio-convert: ## Convert reviewed samples to LabelStudio format (full dataset)
\t@echo "Converting dataset to LabelStudio format..."
\tpython3 src/scripts/convert_to_labelstudio.py \
\t\t--input model/dataset/reviewed_samples \
\t\t--output model/dataset/labelstudio/import \
\t\t--source llm_reviewed
labelstudio-convert-sample: ## Convert 100 samples to LabelStudio format (for testing)
\t@echo "Converting sample dataset to LabelStudio format..."
\tpython3 src/scripts/convert_to_labelstudio.py \
\t\t--input model/dataset/reviewed_samples \
\t\t--output model/dataset/labelstudio/import \
\t\t--limit 100 \
\t\t--source llm_reviewed
Testing & Usage Workflow
1. Initial Setup
# Create necessary directories
mkdir -p data/labelstudio/{projects,templates,config}
mkdir -p data/labelstudio_exports
mkdir -p model/dataset/labelstudio/{import,export,verified}
# Copy the annotation template
cp data/labelstudio/templates/pii_ner_coref_template.xml data/labelstudio/templates/
# Convert a sample of the dataset
make labelstudio-convert-sample
# Start LabelStudio
make labelstudio-up
2. Configure LabelStudio Project
- Open http://localhost:8081
- Login with
admin / changeme123
- Create new project: "Yaak PII Dataset Review"
- In Labeling Setup, paste the template from
data/labelstudio/templates/pii_ner_coref_template.xml
- In Data Import, choose "Import from JSON" and upload files from
model/dataset/labelstudio/import/
3. Annotation Workflow
- Review pre-annotations: LabelStudio will show predicted entities in the interface
- Correct errors: Fix any mislabeled entities or missed entities
- Add co-references: Connect related mentions using relations
- Add quality rating: Rate the sample quality
- Add notes: Document any issues or observations
- Submit: Save the annotation
4. Export Reviewed Data
# From LabelStudio UI:
# 1. Go to Export
# 2. Choose JSON format
# 3. Download to data/labelstudio_exports/
# Convert back to training format (future script)
python src/scripts/convert_from_labelstudio.py \
--input data/labelstudio_exports/project-1-export.json \
--output model/dataset/labelstudio/verified
Success Criteria
Future Enhancements
- Reverse conversion script: Convert LabelStudio exports back to training format
- Inter-annotator agreement: Calculate agreement metrics between annotators
- Active learning: Prioritize samples for review based on model uncertainty
- Batch processing: API integration for programmatic task management
- Quality metrics: Track annotation quality over time
References
Notes
This is marked as a "good first issue" because:
- Clear, step-by-step implementation plan
- Well-defined scope with example code
- Uses standard tools (Docker, Python)
- Touches multiple aspects: DevOps, data processing, annotation workflow
- Good learning opportunity for data annotation pipelines
Background
The project currently maintains three separate dataset directories with different stages of data processing:
Dataset Directory Structure
model/dataset/samples/(~20,000 files): Raw samples generated by LLMmodel/dataset/reviewed_samples/(~20,000 files): Samples reviewed and corrected by LLMmodel/dataset/training_samples/(~6,000 files): Final processed samples used for model trainingCurrent Data Format
Each JSON file contains:
{ "text": "Fatima Khaled resides at 2114 Cedar Crescent in Marseille...", "privacy_mask": [ {"value": "Fatima", "label": "FIRSTNAME"}, {"value": "Khaled", "label": "SURNAME"}, {"value": "2114", "label": "BUILDINGNUM"} ], "coreferences": [ { "cluster_id": 0, "mentions": ["Fatima Khaled", "Her", "she"], "entity_type": "person" } ] }Key characteristics:
FIRSTNAME,SURNAME) without BIO prefixes (noB-/I-notation)Challenge
Currently, there's no systematic human review process for validating:
This issue proposes integrating LabelStudio for efficient human review and quality assurance of the dataset.
Implementation Plan
Phase 1: Setup LabelStudio with Docker
File:
docker-compose.labelstudio.yml(new)Create a dedicated Docker Compose file for LabelStudio:
Usage:
Phase 2: Create Custom LabelStudio Template
File:
data/labelstudio/templates/pii_ner_coref_template.xml(new)Create a custom annotation template that supports:
Template Features:
Phase 3: Recommended Storage Structure
Create a new unified storage structure that works for both LabelStudio and model training:
Rationale:
Phase 4: Dataset Structure Changes
Proposed changes for LabelStudio compatibility:
New format:
{ "id": "20251124103832_fb0dd1a3", "data": { "text": "Fatima Khaled resides at 2114 Cedar Crescent...", "privacy_mask_str": "FIRSTNAME: Fatima | SURNAME: Khaled | BUILDINGNUM: 2114", "coreferences_str": "Cluster 0 (person): Fatima Khaled, Her, she" }, "predictions": [{ "model_version": "llm_generated_v1", "result": [ { "value": { "start": 0, "end": 6, "text": "Fatima", "labels": ["FIRSTNAME"] }, "from_name": "label", "to_name": "text", "type": "labels" } ] }], "annotations": [], "meta": { "created_at": "2024-11-24T10:38:32Z", "source": "llm_generation", "version": "1.0" } }Phase 5: Conversion Script
File:
src/scripts/convert_to_labelstudio.py(new)Create a conversion script to transform existing dataset format to LabelStudio format:
Make it executable:
Phase 6: Update Makefile
File:
MakefileAdd new targets for LabelStudio workflow:
Testing & Usage Workflow
1. Initial Setup
2. Configure LabelStudio Project
admin/changeme123data/labelstudio/templates/pii_ner_coref_template.xmlmodel/dataset/labelstudio/import/3. Annotation Workflow
4. Export Reviewed Data
Success Criteria
src/scripts/convert_to_labelstudio.py)Future Enhancements
References
Notes
This is marked as a "good first issue" because: