Skip to content

Setup LabelStudio for Dataset Review #31

@hanneshapke

Description

@hanneshapke

Background

The project currently maintains three separate dataset directories with different stages of data processing:

Dataset Directory Structure

  • model/dataset/samples/ (~20,000 files): Raw samples generated by LLM
  • model/dataset/reviewed_samples/ (~20,000 files): Samples reviewed and corrected by LLM
  • model/dataset/training_samples/ (~6,000 files): Final processed samples used for model training

Current Data Format

Each JSON file contains:

{
  "text": "Fatima Khaled resides at 2114 Cedar Crescent in Marseille...",
  "privacy_mask": [
    {"value": "Fatima", "label": "FIRSTNAME"},
    {"value": "Khaled", "label": "SURNAME"},
    {"value": "2114", "label": "BUILDINGNUM"}
  ],
  "coreferences": [
    {
      "cluster_id": 0,
      "mentions": ["Fatima Khaled", "Her", "she"],
      "entity_type": "person"
    }
  ]
}

Key characteristics:

  • Labels are direct (e.g., FIRSTNAME, SURNAME) without BIO prefixes (no B-/I- notation)
  • Co-references track entity mentions across the text
  • PII labels include: SURNAME, FIRSTNAME, EMAIL, PHONENUMBER, ADDRESS, CITY, SSN, DRIVER_LICENSE, etc.

Challenge

Currently, there's no systematic human review process for validating:

  1. NER extraction accuracy (are PII entities correctly identified?)
  2. Label correctness (is "Fatima" correctly labeled as FIRSTNAME?)
  3. Co-reference resolution quality (are mentions properly grouped?)

This issue proposes integrating LabelStudio for efficient human review and quality assurance of the dataset.


Implementation Plan

Phase 1: Setup LabelStudio with Docker

File: docker-compose.labelstudio.yml (new)

Create a dedicated Docker Compose file for LabelStudio:

version: '3.8'

services:
  labelstudio:
    image: heartexlabs/label-studio:latest
    container_name: yaak-labelstudio
    ports:
      - "8081:8080"
    volumes:
      # Data storage for LabelStudio projects
      - ./data/labelstudio:/label-studio/data
      # Mount dataset directories (read-only for safety)
      - ./model/dataset/samples:/datasets/samples:ro
      - ./model/dataset/reviewed_samples:/datasets/reviewed_samples:ro
      - ./model/dataset/training_samples:/datasets/training_samples:ro
      # Export directory for reviewed annotations
      - ./data/labelstudio_exports:/exports
    environment:
      - LABEL_STUDIO_LOCAL_FILES_SERVING_ENABLED=true
      - LABEL_STUDIO_LOCAL_FILES_DOCUMENT_ROOT=/datasets
      # For production, set strong credentials
      - LABEL_STUDIO_USERNAME=admin
      - LABEL_STUDIO_PASSWORD=changeme123
    restart: unless-stopped
    networks:
      - yaak-network

networks:
  yaak-network:
    driver: bridge

volumes:
  labelstudio-data:

Usage:

# Start LabelStudio
docker-compose -f docker-compose.labelstudio.yml up -d

# Access at http://localhost:8081
# Login with admin/changeme123 (change in production!)

# Stop LabelStudio
docker-compose -f docker-compose.labelstudio.yml down

Phase 2: Create Custom LabelStudio Template

File: data/labelstudio/templates/pii_ner_coref_template.xml (new)

Create a custom annotation template that supports:

  1. NER annotation without B-/I- prefixes
  2. Co-reference review for entity mentions
<View>
  <Header value="PII Entity and Co-reference Annotation"/>
  
  <Text name="text" value="$text"/>
  
  <!-- NER Annotation Section -->
  <View style="box-shadow: 2px 2px 5px #999; padding: 20px; margin-top: 2em; border-radius: 5px;">
    <Header value="Named Entity Recognition (PII)" size="4"/>
    <Labels name="label" toName="text">
      <!-- Person Information -->
      <Label value="FIRSTNAME" background="#FFA39E"/>
      <Label value="SURNAME" background="#FF7875"/>
      
      <!-- Contact Information -->
      <Label value="EMAIL" background="#FFD666"/>
      <Label value="PHONENUMBER" background="#FFC53D"/>
      <Label value="URL" background="#FFB340"/>
      
      <!-- Location Information -->
      <Label value="BUILDINGNUM" background="#95DE64"/>
      <Label value="STREET" background="#73D13D"/>
      <Label value="CITY" background="#52C41A"/>
      <Label value="STATE" background="#389E0D"/>
      <Label value="ZIP" background="#237804"/>
      <Label value="COUNTRY" background="#135200"/>
      
      <!-- Identification Numbers -->
      <Label value="SSN" background="#FF85C0"/>
      <Label value="DRIVER_LICENSE" background="#F759AB"/>
      <Label value="PASSPORT" background="#EB2F96"/>
      <Label value="NATIONAL_ID" background="#C41D7F"/>
      <Label value="IDCARDNUM" background="#9E1068"/>
      <Label value="TAXNUM" background="#780650"/>
      <Label value="LICENSEPLATENUM" background="#520339"/>
      
      <!-- Financial Information -->
      <Label value="IBAN" background="#91CAFF"/>
      
      <!-- Company Information -->
      <Label value="COMPANYNAME" background="#BAE0FF"/>
      
      <!-- Other -->
      <Label value="DATEOFBIRTH" background="#D3ADF7"/>
      <Label value="DOB" background="#B37FEB"/>
      <Label value="AGE" background="#9254DE"/>
      <Label value="PASSWORD" background="#FF4D4F"/>
    </Labels>
  </View>
  
  <!-- Co-reference Annotation Section -->
  <View style="box-shadow: 2px 2px 5px #999; padding: 20px; margin-top: 2em; border-radius: 5px;">
    <Header value="Co-reference Resolution" size="4"/>
    <Relations name="coreference" toName="text">
      <Relation value="COREF_PERSON" background="#69c0ff"/>
      <Relation value="COREF_ORGANIZATION" background="#95de64"/>
      <Relation value="COREF_LOCATION" background="#ffc069"/>
      <Relation value="COREF_OTHER" background="#d3adf7"/>
    </Relations>
  </View>
  
  <!-- Pre-annotated Data Display -->
  <View style="box-shadow: 2px 2px 5px #999; padding: 20px; margin-top: 2em; border-radius: 5px; background: #f5f5f5;">
    <Header value="Pre-annotated Data (Reference)" size="4"/>
    <Text name="privacy_mask_display" value="$privacy_mask_str"/>
    <Text name="coreference_display" value="$coreferences_str"/>
  </View>
  
  <!-- Review Notes -->
  <View style="margin-top: 2em;">
    <Header value="Review Notes" size="4"/>
    <TextArea name="notes" toName="text" 
              placeholder="Add any notes about annotation quality, issues, or corrections..."
              rows="3"/>
    
    <Choices name="quality" toName="text" choice="single" showInline="true">
      <Choice value="Excellent"/>
      <Choice value="Good"/>
      <Choice value="Fair"/>
      <Choice value="Poor"/>
    </Choices>
  </View>
</View>

Template Features:

  • ✅ Direct label annotation (no B-/I- prefixes needed)
  • ✅ Color-coded by category (Person, Contact, Location, ID, Financial)
  • ✅ Co-reference relations support
  • ✅ Display pre-annotated data for reference
  • ✅ Quality rating and notes for each sample

Phase 3: Recommended Storage Structure

Create a new unified storage structure that works for both LabelStudio and model training:

model/
├── dataset/
│   ├── raw/                    # Raw generated samples (existing: samples/)
│   ├── reviewed/              # LLM-reviewed samples (existing: reviewed_samples/)
│   ├── training/              # Final training data (existing: training_samples/)
│   └── labelstudio/           # NEW: LabelStudio-specific data
│       ├── import/            # Converted data ready for import
│       ├── export/            # Human-reviewed exports from LabelStudio
│       └── verified/          # Final verified dataset
│
data/                          # NEW: Root-level data directory
├── labelstudio/
│   ├── projects/              # LabelStudio project files
│   ├── templates/             # Annotation templates
│   │   └── pii_ner_coref_template.xml
│   └── config/                # LabelStudio configuration
└── labelstudio_exports/       # Export destination

Rationale:

  • Separation of concerns: Keep LabelStudio data separate from model data
  • Docker-friendly: Easy to mount specific directories
  • Version control: Can gitignore large data files while keeping configs
  • Workflow clarity: Clear path from raw → reviewed → training → verified

Phase 4: Dataset Structure Changes

Proposed changes for LabelStudio compatibility:

  1. Add task IDs: Each sample needs a unique ID for LabelStudio tracking
  2. Flatten structure: Convert nested JSON to LabelStudio's expected format
  3. Pre-annotation support: Include existing labels as predictions
  4. Metadata: Add creation timestamp, source, version

New format:

{
  "id": "20251124103832_fb0dd1a3",
  "data": {
    "text": "Fatima Khaled resides at 2114 Cedar Crescent...",
    "privacy_mask_str": "FIRSTNAME: Fatima | SURNAME: Khaled | BUILDINGNUM: 2114",
    "coreferences_str": "Cluster 0 (person): Fatima Khaled, Her, she"
  },
  "predictions": [{
    "model_version": "llm_generated_v1",
    "result": [
      {
        "value": {
          "start": 0,
          "end": 6,
          "text": "Fatima",
          "labels": ["FIRSTNAME"]
        },
        "from_name": "label",
        "to_name": "text",
        "type": "labels"
      }
    ]
  }],
  "annotations": [],
  "meta": {
    "created_at": "2024-11-24T10:38:32Z",
    "source": "llm_generation",
    "version": "1.0"
  }
}

Phase 5: Conversion Script

File: src/scripts/convert_to_labelstudio.py (new)

Create a conversion script to transform existing dataset format to LabelStudio format:

#!/usr/bin/env python3
"""
Convert Yaak PII dataset to LabelStudio import format.

Usage:
    python src/scripts/convert_to_labelstudio.py \
        --input model/dataset/reviewed_samples \
        --output model/dataset/labelstudio/import \
        --limit 100

This script converts the internal JSON format to LabelStudio's task format,
including character-level span annotations for NER and co-reference data.
"""

import argparse
import json
import re
from pathlib import Path
from typing import Any, Dict, List
from datetime import datetime


def find_entity_spans(text: str, entity_value: str) -> List[tuple]:
    """
    Find all occurrences of entity_value in text and return (start, end) positions.
    
    Args:
        text: The full text to search in
        entity_value: The entity string to find
        
    Returns:
        List of (start, end) tuples for each occurrence
    """
    spans = []
    # Use word boundary matching for better accuracy
    pattern = re.escape(entity_value)
    for match in re.finditer(pattern, text):
        spans.append((match.start(), match.end()))
    return spans


def convert_privacy_mask_to_spans(text: str, privacy_mask: List[Dict]) -> List[Dict]:
    """
    Convert privacy_mask entities to LabelStudio span annotations.
    
    Args:
        text: The full text
        privacy_mask: List of {"value": str, "label": str}
        
    Returns:
        List of LabelStudio annotation results
    """
    results = []
    seen_positions = set()  # Track already annotated positions to avoid duplicates
    
    for entity in privacy_mask:
        value = entity["value"]
        label = entity["label"]
        
        # Find all occurrences of this entity in the text
        spans = find_entity_spans(text, value)
        
        for start, end in spans:
            # Skip if this position already annotated (handles duplicates)
            position_key = (start, end)
            if position_key in seen_positions:
                continue
            seen_positions.add(position_key)
            
            results.append({
                "value": {
                    "start": start,
                    "end": end,
                    "text": value,
                    "labels": [label]
                },
                "from_name": "label",
                "to_name": "text",
                "type": "labels"
            })
    
    return results


def format_privacy_mask_display(privacy_mask: List[Dict]) -> str:
    """Format privacy_mask for human-readable display."""
    items = [f"{item['label']}: {item['value']}" for item in privacy_mask]
    return " | ".join(items)


def format_coreferences_display(coreferences: List[Dict]) -> str:
    """Format coreferences for human-readable display."""
    if not coreferences:
        return "No coreferences"
    
    items = []
    for coref in coreferences:
        cluster_id = coref["cluster_id"]
        entity_type = coref["entity_type"]
        mentions = ", ".join(coref["mentions"])
        items.append(f"Cluster {cluster_id} ({entity_type}): {mentions}")
    return " | ".join(items)


def convert_sample_to_labelstudio(
    sample_data: Dict[str, Any],
    sample_id: str,
    source: str = "llm_reviewed"
) -> Dict[str, Any]:
    """
    Convert a single sample to LabelStudio task format.
    
    Args:
        sample_data: The sample dictionary with text, privacy_mask, coreferences
        sample_id: Unique identifier for this sample
        source: Source of the data (e.g., 'llm_reviewed', 'llm_generated')
        
    Returns:
        LabelStudio task dictionary
    """
    text = sample_data["text"]
    privacy_mask = sample_data.get("privacy_mask", [])
    coreferences = sample_data.get("coreferences", [])
    
    # Convert entities to LabelStudio span annotations
    span_annotations = convert_privacy_mask_to_spans(text, privacy_mask)
    
    # Create LabelStudio task
    task = {
        "id": sample_id,
        "data": {
            "text": text,
            "privacy_mask_str": format_privacy_mask_display(privacy_mask),
            "coreferences_str": format_coreferences_display(coreferences)
        },
        "predictions": [{
            "model_version": source,
            "created_at": datetime.utcnow().isoformat() + "Z",
            "result": span_annotations
        }],
        "annotations": [],
        "meta": {
            "created_at": datetime.utcnow().isoformat() + "Z",
            "source": source,
            "version": "1.0",
            "original_file": sample_id
        }
    }
    
    return task


def convert_dataset(
    input_dir: Path,
    output_dir: Path,
    limit: int | None = None,
    source: str = "llm_reviewed"
) -> int:
    """
    Convert all samples in input directory to LabelStudio format.
    
    Args:
        input_dir: Directory containing JSON samples
        output_dir: Directory to write LabelStudio tasks
        limit: Maximum number of files to convert (None for all)
        source: Source identifier for metadata
        
    Returns:
        Number of files converted
    """
    output_dir.mkdir(parents=True, exist_ok=True)
    
    # Get all JSON files
    json_files = sorted(input_dir.glob("*.json"))
    
    if limit:
        json_files = json_files[:limit]
    
    print(f"Converting {len(json_files)} files from {input_dir}...")
    
    converted_count = 0
    for json_file in json_files:
        try:
            # Read original sample
            with open(json_file, 'r', encoding='utf-8') as f:
                sample_data = json.load(f)
            
            # Extract sample ID from filename (remove .json extension)
            sample_id = json_file.stem
            
            # Convert to LabelStudio format
            ls_task = convert_sample_to_labelstudio(sample_data, sample_id, source)
            
            # Write to output
            output_file = output_dir / f"{sample_id}.json"
            with open(output_file, 'w', encoding='utf-8') as f:
                json.dump(ls_task, f, indent=2, ensure_ascii=False)
            
            converted_count += 1
            
            if converted_count % 100 == 0:
                print(f"Converted {converted_count}/{len(json_files)} files...")
                
        except Exception as e:
            print(f"Error converting {json_file}: {e}")
            continue
    
    print(f"✓ Successfully converted {converted_count} files to {output_dir}")
    return converted_count


def main():
    parser = argparse.ArgumentParser(
        description="Convert Yaak PII dataset to LabelStudio format"
    )
    parser.add_argument(
        "--input",
        type=Path,
        default=Path("model/dataset/reviewed_samples"),
        help="Input directory with JSON samples (default: model/dataset/reviewed_samples)"
    )
    parser.add_argument(
        "--output",
        type=Path,
        default=Path("model/dataset/labelstudio/import"),
        help="Output directory for LabelStudio tasks (default: model/dataset/labelstudio/import)"
    )
    parser.add_argument(
        "--limit",
        type=int,
        default=None,
        help="Limit number of files to convert (default: convert all)"
    )
    parser.add_argument(
        "--source",
        type=str,
        default="llm_reviewed",
        help="Source identifier for metadata (default: llm_reviewed)"
    )
    
    args = parser.parse_args()
    
    if not args.input.exists():
        print(f"Error: Input directory {args.input} does not exist")
        return 1
    
    converted_count = convert_dataset(
        args.input,
        args.output,
        args.limit,
        args.source
    )
    
    print(f"\n{'='*60}")
    print(f"Conversion complete!")
    print(f"Converted {converted_count} samples")
    print(f"Output directory: {args.output}")
    print(f"\nNext steps:")
    print(f"1. Start LabelStudio: docker-compose -f docker-compose.labelstudio.yml up -d")
    print(f"2. Access LabelStudio at http://localhost:8081")
    print(f"3. Create a new project and import tasks from {args.output}")
    print(f"4. Use the custom template from data/labelstudio/templates/pii_ner_coref_template.xml")
    print(f"{'='*60}")
    
    return 0


if __name__ == "__main__":
    exit(main())

Make it executable:

chmod +x src/scripts/convert_to_labelstudio.py

Phase 6: Update Makefile

File: Makefile

Add new targets for LabelStudio workflow:

# LabelStudio targets
.PHONY: labelstudio-up labelstudio-down labelstudio-convert labelstudio-convert-sample

labelstudio-up:  ## Start LabelStudio with Docker
\t@echo "Starting LabelStudio..."
\tdocker-compose -f docker-compose.labelstudio.yml up -d
\t@echo "LabelStudio is running at http://localhost:8081"
\t@echo "Default login: admin / changeme123"

labelstudio-down:  ## Stop LabelStudio
\t@echo "Stopping LabelStudio..."
\tdocker-compose -f docker-compose.labelstudio.yml down

labelstudio-convert:  ## Convert reviewed samples to LabelStudio format (full dataset)
\t@echo "Converting dataset to LabelStudio format..."
\tpython3 src/scripts/convert_to_labelstudio.py \
\t\t--input model/dataset/reviewed_samples \
\t\t--output model/dataset/labelstudio/import \
\t\t--source llm_reviewed

labelstudio-convert-sample:  ## Convert 100 samples to LabelStudio format (for testing)
\t@echo "Converting sample dataset to LabelStudio format..."
\tpython3 src/scripts/convert_to_labelstudio.py \
\t\t--input model/dataset/reviewed_samples \
\t\t--output model/dataset/labelstudio/import \
\t\t--limit 100 \
\t\t--source llm_reviewed

Testing & Usage Workflow

1. Initial Setup

# Create necessary directories
mkdir -p data/labelstudio/{projects,templates,config}
mkdir -p data/labelstudio_exports
mkdir -p model/dataset/labelstudio/{import,export,verified}

# Copy the annotation template
cp data/labelstudio/templates/pii_ner_coref_template.xml data/labelstudio/templates/

# Convert a sample of the dataset
make labelstudio-convert-sample

# Start LabelStudio
make labelstudio-up

2. Configure LabelStudio Project

  1. Open http://localhost:8081
  2. Login with admin / changeme123
  3. Create new project: "Yaak PII Dataset Review"
  4. In Labeling Setup, paste the template from data/labelstudio/templates/pii_ner_coref_template.xml
  5. In Data Import, choose "Import from JSON" and upload files from model/dataset/labelstudio/import/

3. Annotation Workflow

  1. Review pre-annotations: LabelStudio will show predicted entities in the interface
  2. Correct errors: Fix any mislabeled entities or missed entities
  3. Add co-references: Connect related mentions using relations
  4. Add quality rating: Rate the sample quality
  5. Add notes: Document any issues or observations
  6. Submit: Save the annotation

4. Export Reviewed Data

# From LabelStudio UI:
# 1. Go to Export
# 2. Choose JSON format
# 3. Download to data/labelstudio_exports/

# Convert back to training format (future script)
python src/scripts/convert_from_labelstudio.py \
    --input data/labelstudio_exports/project-1-export.json \
    --output model/dataset/labelstudio/verified

Success Criteria

  • Docker Compose file created for LabelStudio
  • LabelStudio starts successfully on port 8081
  • Custom annotation template created with:
    • All PII labels (without B-/I- prefixes)
    • Co-reference relation support
    • Pre-annotation display
    • Quality rating and notes
  • Directory structure created for LabelStudio data
  • Conversion script implemented (src/scripts/convert_to_labelstudio.py)
  • Conversion script handles:
    • Entity span calculation
    • Pre-annotation generation
    • Metadata preservation
  • Makefile targets added for easy workflow
  • Successfully import 100 samples into LabelStudio
  • Documentation updated with workflow instructions

Future Enhancements

  1. Reverse conversion script: Convert LabelStudio exports back to training format
  2. Inter-annotator agreement: Calculate agreement metrics between annotators
  3. Active learning: Prioritize samples for review based on model uncertainty
  4. Batch processing: API integration for programmatic task management
  5. Quality metrics: Track annotation quality over time

References


Notes

This is marked as a "good first issue" because:

  • Clear, step-by-step implementation plan
  • Well-defined scope with example code
  • Uses standard tools (Docker, Python)
  • Touches multiple aspects: DevOps, data processing, annotation workflow
  • Good learning opportunity for data annotation pipelines

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions