Skip to content

Generation of own dataset with Dolma Tokenizer CLI #225

Open
@WenJett

Description

@WenJett

Hi,

Appreciate your work done so far.

With the new release of OLMo 2, the tokenizer used seems to be allenai_domla2.json but in prepare_memmap_dataset.py, the tokenizer is allenai/eleuther-ai-gpt-neox-20b-pii-special.

Understand that the above Python script has been depreciated, so I have also tried the Dolma tokenizer CLI with the example below.

dolma tokens --documents ./data.json.gz --destination ./ --tokenizer.name_or_path allenai/dolma2-tokenizer --tokenizer.eos_token_id 100257 --tokenizer.pad_token_id 100277 --dtype uint32

Although a .npy file is generated, when I run the generated .npy file with official-1124/OLMo2-7B-stage2-seed42.yaml by modifying the data paths at the bottom, I will get an error of "unable to mmap an empty file".

Hence, I was wondering if

  1. the correct tokenizer to be used is allenai/dolma2-tokenizer or allenai/dolma2-tokenizer-sigdig?
  2. there should be anything else I should include in the flags for the CLI?
  3. my data in data.json.gz does contain a field 'text' which I assume is the bare minimum requirement?

I hope you can provide some guidance for this matter.

Thank you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions