Huge amount of CPU RAM needed during training

Hello team!

With the current version of fairseq we noticed that a huge amount of RAM (CPU RAM, not GPU RAM) is required in order to run the training. Moreover this is correlated to the number of GPU used on the same machine.

So my guess is that the binarized data used for the training is completely loaded in RAM for every GPU process, this will result in having the amount of CPU RAM is roughly:
`RAM ~= (number of GPUs) * sizeof(binarized data)`.
If this is true, the amount of RAM needed for medium/large training sets is huge (hundreds of GB) wrt size of training set (less than 100 GB).

If this is the case, why can't we use a memory mapped training set? So that the amount of RAM depends exclusively on `sizeof(binarized data)` ?

I'm available to work on this if needed, can you please give me some code context, or a good starting point to begin with?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Huge amount of CPU RAM needed during training #574

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Huge amount of CPU RAM needed during training #574

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions