Data out of bounds when using ‘dolma tokens --dtype uint32’

<img width="340" alt="image" src="https://github.com/allenai/dolma/assets/87408988/4be648b0-b385-46a7-adb9-09bc96e1da0c">

After using commad
```
dolma tokens \
    --documents "dataset/${data_source}_add_id" \
    --tokenizer.name_or_path Qwen/Qwen1.5-7B-Chat \
    --destination dataset/${data_source}_npy \
    --tokenizer.eos_token_id 151643\
    --tokenizer.pad_token_id 151646 \
    --dtype "uint32" \
    --processes 20
```
I use the code below to read the  memmap file. The data is out of bounds as shown above and the vocab size is only 150000.
`data = MemMapDataset(filePath, chunk_size=2048, memmap_dtype="uint32")`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data out of bounds when using ‘dolma tokens --dtype uint32’ #142

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Data out of bounds when using ‘dolma tokens --dtype uint32’ #142

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions