Skip to content

Silent zero-loss training when vocab_size/class_token_index are hardcoded inconsistently (mmBERT token-level) #332

@arthrod

Description

@arthrod

Summary

Token-level training can silently run with loss=0 (and grad_norm=0) when vocab_size and class_token_index are hardcoded to values that assume GLiNER special tokens already exist in the tokenizer vocab.

In our case, this happened with jhu-clsp/mmBERT-base on GLiNER 0.2.25.

Environment

  • GLiNER: 0.2.25
  • transformers: 5.1.0
  • torch: 2.10.0+rocm7.1
  • accelerate: 1.12.0
  • Python: 3.13.12
  • Hardware: AMD Instinct MI300X (single GPU, ROCm)

What happened

Training started normally, but logs remained flat at zero:

{'loss': '0', 'grad_norm': '0', 'learning_rate': '5.656e-06', 'epoch': '0.05678'}
{'loss': '0', 'grad_norm': '0', 'learning_rate': '5.866e-06', 'epoch': '0.05889'}
...

The dataset itself was healthy (non-empty labels):

  • 367,511 training samples (all with spans)
  • 1,949,207 total spans
  • per-batch positive label entries were in the thousands after fixing the config

Root cause

We had this in config:

model_name: jhu-clsp/mmBERT-base
vocab_size: 256003
class_token_index: 256001

For this tokenizer, base vocab is 256000.

GLiNER special tokens (<<ENT>>, <<SEP>>, [FLERT]) were not actually present in tokenizer state, but those indices were treated as if they were valid. That led to effectively empty entity prompt setup and no learning signal (loss=0, grad_norm=0) without a hard failure.

Workaround (confirmed)

Set both values to -1 so GLiNER auto-adds tokens and auto-detects indices:

vocab_size: -1
class_token_index: -1

After this, initial loss probe became healthy (roughly 846012892) and training proceeded normally.

Expected behavior

If vocab_size / class_token_index are set explicitly but tokenizer is inconsistent, GLiNER should fail fast with a clear error instead of silently training with zero loss.

Suggested fixes

  1. Validate tokenizer/config consistency before training:
    • if class_token_index >= len(tokenizer) -> raise ValueError
    • if expected special tokens are missing but explicit indices are provided -> raise with actionable guidance
  2. Add an early training guard:
    • if collated labels have zero class dimension / all-zero positives for several warmup batches -> raise immediately
  3. Document strongly in token-level config docs:
    • prefer vocab_size: -1 and class_token_index: -1 unless you are certain indices map to real tokenizer entries.

Why this matters

This failure mode looks like a normal run (steps advancing, LR changing) while training does nothing. It is expensive and easy to miss on long runs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions