-
Notifications
You must be signed in to change notification settings - Fork 252
Description
Summary
Token-level training can silently run with loss=0 (and grad_norm=0) when vocab_size and class_token_index are hardcoded to values that assume GLiNER special tokens already exist in the tokenizer vocab.
In our case, this happened with jhu-clsp/mmBERT-base on GLiNER 0.2.25.
Environment
- GLiNER:
0.2.25 - transformers:
5.1.0 - torch:
2.10.0+rocm7.1 - accelerate:
1.12.0 - Python:
3.13.12 - Hardware: AMD Instinct MI300X (single GPU, ROCm)
What happened
Training started normally, but logs remained flat at zero:
{'loss': '0', 'grad_norm': '0', 'learning_rate': '5.656e-06', 'epoch': '0.05678'}
{'loss': '0', 'grad_norm': '0', 'learning_rate': '5.866e-06', 'epoch': '0.05889'}
...
The dataset itself was healthy (non-empty labels):
- 367,511 training samples (all with spans)
- 1,949,207 total spans
- per-batch positive label entries were in the thousands after fixing the config
Root cause
We had this in config:
model_name: jhu-clsp/mmBERT-base
vocab_size: 256003
class_token_index: 256001For this tokenizer, base vocab is 256000.
GLiNER special tokens (<<ENT>>, <<SEP>>, [FLERT]) were not actually present in tokenizer state, but those indices were treated as if they were valid. That led to effectively empty entity prompt setup and no learning signal (loss=0, grad_norm=0) without a hard failure.
Workaround (confirmed)
Set both values to -1 so GLiNER auto-adds tokens and auto-detects indices:
vocab_size: -1
class_token_index: -1After this, initial loss probe became healthy (roughly 8460–12892) and training proceeded normally.
Expected behavior
If vocab_size / class_token_index are set explicitly but tokenizer is inconsistent, GLiNER should fail fast with a clear error instead of silently training with zero loss.
Suggested fixes
- Validate tokenizer/config consistency before training:
- if
class_token_index >= len(tokenizer)-> raiseValueError - if expected special tokens are missing but explicit indices are provided -> raise with actionable guidance
- if
- Add an early training guard:
- if collated labels have zero class dimension / all-zero positives for several warmup batches -> raise immediately
- Document strongly in token-level config docs:
- prefer
vocab_size: -1andclass_token_index: -1unless you are certain indices map to real tokenizer entries.
- prefer
Why this matters
This failure mode looks like a normal run (steps advancing, LR changing) while training does nothing. It is expensive and easy to miss on long runs.