Skip to content

Batch Viewer : Why Sequence Length 2049? #123

Closed
@prakharg24

Description

@prakharg24

Hi,
I am using utils/batch_viewer.py to iterate through Pythia's training data and calculate some batch-level statistics.
Firstly, there are some gaps between the actual code in batch_viewer.py and the expected code according to the README (For example, it doesn't take any 'config file' as input, the 'load file' name needs to be supplied separately, etc.). But these differences were obvious enough that I could fix them on my end and run the code.

However, it's the final step of saving the data after loading the buffer that I'm a bit confused about. I have two questions,

  1. Given that each 'sequence' in the dataset is of a different length, can someone confirm that the training is performed by simply concatenating the whole dataset as a single sequence of tokens, and then dividing it into sentences and batches? This would mean that some 'sequences' are broken into different sentences or batches, and even one 'sentence' of 2048 tokens might contain multiple actual dataset sequences. I believe this is how most LLMs are trained, but I couldn't find the exact details in the paper.
  2. The MMapDataset function attempts to reshape the final concatenated sequence into (-1, 2049). I don't understand why 2049. Isn't the sentence length supposed to be 2048? I'm new to the specifics of how LLMs are trained so I may be missing some trivial detail here, but I don't understand why 2048 became 2049.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions