Possible bug in `local_shuffle`?

Hi, thanks for the great library! It's good to see such a library written in Python, and it is a great source to learn about data side of LLM pretraining.

I was looking at the part where data is shuffled and saw that `local_shuffle` is not working as what I expected. I expected each process to gather a `local_shuffle` number of tokenized documents (each line in a `json.gz` file) from source paths (`json.gz` files), shuffle those, and then write those tokenized documents via mmap.

But it seems that the code does the [shuffling](https://github.com/allenai/dolma/blob/c732011bd8ef442f64944378c90455c7364a983e/python/dolma/tokenizer/executor.py#L121) and [writing](https://github.com/allenai/dolma/blob/c732011bd8ef442f64944378c90455c7364a983e/python/dolma/tokenizer/executor.py#L124) for each document, instead of a `local_shuffle` amount of documents. I think this makes local shuffling a no-op and also results in a more frequent writing, which may have performance implications. I think something like de-indenting lines 121 and below could be a fix?

Is this a bug, or am I misunderstanding something? Thank you!
I am tagging @soldni who wrote this file :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible bug in `local_shuffle`? #139

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Possible bug in local_shuffle? #139

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Possible bug in `local_shuffle`? #139