Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM on preprocessing dataset with large number of documents #34

Open
RaymondLi0 opened this issue Mar 10, 2023 · 0 comments
Open

OOM on preprocessing dataset with large number of documents #34

RaymondLi0 opened this issue Mar 10, 2023 · 0 comments
Labels
bug Something isn't working

Comments

@RaymondLi0
Copy link
Collaborator

When processing a dataset of 55GB, 31M samples, preprocessing runs out-of-memory on a machine with 1.5TB memory.

The error happens when saving the index. For other larger datasets there was no issue. But this dataset is the one with the most documents.

Traceback (most recent call last):
  File "Megatron-LM/tools/preprocess_data.py", line 227, in <module>
    main()
  File "Megatron-LM/tools/preprocess_data.py", line 224, in main
    builders[key].finalize(output_idx_files[key])
  File "/app/Megatron-LM/megatron/data/indexed_dataset.py", line 576, in finalize
    index.write(self._sizes, self._doc_idx)
  File "/app/Megatron-LM/megatron/data/indexed_dataset.py", line 369, in write
    pointers = self._get_pointers(sizes)
  File "/app/Megatron-LM/megatron/data/indexed_dataset.py", line 363, in _get_pointers
    pointers.append(address)
MemoryError

The workaround for now is to first shard the dataset, and tokenize each shard independently. At training time, the shards can be blended together

@RaymondLi0 RaymondLi0 added the bug Something isn't working label Mar 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant