BOS tokens not properly added in some circumstances

When using `DocumentTokenizer` if an `eos_token` is specified the tokenizer post processor is replaced with one that appends and EOS. However, this has the effect of NOT placing a BOS token at the beginning of the sequence.

See here: https://github.com/huggingface/datatrove/blob/main/src/datatrove/utils/tokenization.py#L55

This can be reproduced by tokenizing with a tokenizer like Llama 3 and looking at the raw token values

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

BOS tokens not properly added in some circumstances #345

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

BOS tokens not properly added in some circumstances #345

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions