When using DocumentTokenizer if an eos_token is specified the tokenizer post processor is replaced with one that appends and EOS. However, this has the effect of NOT placing a BOS token at the beginning of the sequence.
See here: https://github.com/huggingface/datatrove/blob/main/src/datatrove/utils/tokenization.py#L55
This can be reproduced by tokenizing with a tokenizer like Llama 3 and looking at the raw token values