Closed
Description
Hi,
I am using utils/batch_viewer.py to iterate through Pythia's training data and calculate some batch-level statistics.
Firstly, there are some gaps between the actual code in batch_viewer.py and the expected code according to the README (For example, it doesn't take any 'config file' as input, the 'load file' name needs to be supplied separately, etc.). But these differences were obvious enough that I could fix them on my end and run the code.
However, it's the final step of saving the data after loading the buffer that I'm a bit confused about. I have two questions,
- Given that each 'sequence' in the dataset is of a different length, can someone confirm that the training is performed by simply concatenating the whole dataset as a single sequence of tokens, and then dividing it into sentences and batches? This would mean that some 'sequences' are broken into different sentences or batches, and even one 'sentence' of 2048 tokens might contain multiple actual dataset sequences. I believe this is how most LLMs are trained, but I couldn't find the exact details in the paper.
- The MMapDataset function attempts to reshape the final concatenated sequence into (-1, 2049). I don't understand why 2049. Isn't the sentence length supposed to be 2048? I'm new to the specifics of how LLMs are trained so I may be missing some trivial detail here, but I don't understand why 2048 became 2049.
Metadata
Metadata
Assignees
Labels
No labels