Skip to content

Data out of bounds when using ‘dolma tokens --dtype uint32’ #142

Open
@Jackwaterveg

Description

@Jackwaterveg
image

After using commad

dolma tokens \
    --documents "dataset/${data_source}_add_id" \
    --tokenizer.name_or_path Qwen/Qwen1.5-7B-Chat \
    --destination dataset/${data_source}_npy \
    --tokenizer.eos_token_id 151643\
    --tokenizer.pad_token_id 151646 \
    --dtype "uint32" \
    --processes 20

I use the code below to read the memmap file. The data is out of bounds as shown above and the vocab size is only 150000.
data = MemMapDataset(filePath, chunk_size=2048, memmap_dtype="uint32")

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions