Skip to content

Possible bug in local_shuffle? #139

Closed
@hwijeen

Description

@hwijeen

Hi, thanks for the great library! It's good to see such a library written in Python, and it is a great source to learn about data side of LLM pretraining.

I was looking at the part where data is shuffled and saw that local_shuffle is not working as what I expected. I expected each process to gather a local_shuffle number of tokenized documents (each line in a json.gz file) from source paths (json.gz files), shuffle those, and then write those tokenized documents via mmap.

But it seems that the code does the shuffling and writing for each document, instead of a local_shuffle amount of documents. I think this makes local shuffling a no-op and also results in a more frequent writing, which may have performance implications. I think something like de-indenting lines 121 and below could be a fix?

Is this a bug, or am I misunderstanding something? Thank you!
I am tagging @soldni who wrote this file :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions