Description
Hi,
Appreciate your work done so far.
With the new release of OLMo 2, the tokenizer used seems to be allenai_domla2.json but in prepare_memmap_dataset.py, the tokenizer is allenai/eleuther-ai-gpt-neox-20b-pii-special.
Understand that the above Python script has been depreciated, so I have also tried the Dolma tokenizer CLI with the example below.
dolma tokens --documents ./data.json.gz --destination ./ --tokenizer.name_or_path allenai/dolma2-tokenizer --tokenizer.eos_token_id 100257 --tokenizer.pad_token_id 100277 --dtype uint32
Although a .npy file is generated, when I run the generated .npy file with official-1124/OLMo2-7B-stage2-seed42.yaml by modifying the data paths at the bottom, I will get an error of "unable to mmap an empty file".
Hence, I was wondering if
- the correct tokenizer to be used is allenai/dolma2-tokenizer or allenai/dolma2-tokenizer-sigdig?
- there should be anything else I should include in the flags for the CLI?
- my data in data.json.gz does contain a field 'text' which I assume is the bare minimum requirement?
I hope you can provide some guidance for this matter.
Thank you.