Generation of own dataset with Dolma Tokenizer CLI

Hi,

Appreciate your work done so far.

With the new release of OLMo 2, the tokenizer used seems to be **allenai_domla2.json** but in **prepare_memmap_dataset.py**, the tokenizer is **allenai/eleuther-ai-gpt-neox-20b-pii-special**.

Understand that the above Python script has been depreciated, so I have also tried the Dolma tokenizer CLI with the example below.

dolma tokens --documents ./data.json.gz --destination ./ --tokenizer.name_or_path allenai/dolma2-tokenizer --tokenizer.eos_token_id 100257 --tokenizer.pad_token_id 100277 --dtype uint32 

Although a .npy file is generated, when I run the generated .npy file with official-1124/OLMo2-7B-stage2-seed42.yaml by modifying the data paths at the bottom, I will get an error of "unable to mmap an empty file".

Hence, I was wondering if 
1. the correct tokenizer to be used is allenai/dolma2-tokenizer or allenai/dolma2-tokenizer-sigdig?
2. there should be anything else I should include in the flags for the CLI?
3. my data in data.json.gz does contain a field 'text' which I assume is the bare minimum requirement?

I hope you can provide some guidance for this matter.

Thank you.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generation of own dataset with Dolma Tokenizer CLI #225

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Generation of own dataset with Dolma Tokenizer CLI #225

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions