load dataset shard for training #203

chenyangyu1988 · 2019-01-12T00:47:16Z

Summary:
The intention of this diff is to reduce the memory usage when each node load the sharded dataset.

The current implmentation is that every node will load the whole dataset into memory and then take the shard, which could cause OOM issue because num_gpus * dataset_size

This diff enabled that

each node will only load the sharded dataset into memory, which means the total memory usage should approximate same when compare multi gpus and single gpu
we take the shard based on [rank, rank + world_size * 1, rank + world_size * 2, ....], and we might need to pad one more example in some sharded dataset

Example
dataset = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
world_size = 3

shard_1 = [1, 4, 7, 10]
shard_2 = [2, 5, 8, 8]
shard_3 = [3, 6, 9, 9]

The benefits of this Sharding + Padding approach is that

It doesn't require us to know the total dataset size in advance
The padding guarantee that each shard have the same number of examples which means we don't need to handle potential batch different issue
For every single shard, the maximum pad is 1 which is negligible when dataset size is large

To be aware, the current hiveio API is not streamed, so there will still be OOM issue for hive reader even the dataset could fits in memory

Differential Revision: D13644994

Summary: Pull Request resolved: facebookresearch#203 ``` Mainly Change 1. read_from_file and hive_reader will accept rank and world_size as input parameters, it will only load the sharded data that required by the node 2. we take the shard based on rank + padding so that we don't need to know the dataset size ahead. offset = rank * (datasize // world_size) + min(rank, datasize % world_size) len = datasize // world_size + (1 if rank < datasize % world_size else 0) ``` The intention of this diff is to reduce the memory usage when each node load the sharded dataset. The current implmentation is that every node will load the whole dataset into memory and then take the shard, which could cause OOM issue because num_gpus * dataset_size This diff enabled that 1. each node will only load the sharded dataset into memory, which means the total memory usage should approximate same when compare multi gpus and single gpu 2. we take the shard based on [rank, rank + world_size * 1, rank + world_size * 2, ....], and we might need to pad one more example in some sharded dataset Example dataset = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] world_size = 3 shard_1 = [1, 2, 3, 4] shard_2 = [5, 6, 7, 7] shard_3 = [8, 9, 10, 10] The benefits of this Sharding + Padding approach is that 1. It doesn't require us to know the total dataset size in advance 2. The padding guarantee that each shard have the same number of examples which means we don't need to handle potential batch different issue 3. For every single shard, the maximum pad is 1 which is negligible when dataset size is large To be aware, the current hiveio API is not streamed, so there will still be OOM issue for hive reader even the dataset could fits in memory Differential Revision: D13644994 fbshipit-source-id: 1d84eefa78c13c9867ef38a378855ffde2295795

Summary: Pull Request resolved: facebookresearch#203 ``` Mainly Change 1. read_from_file and hive_reader will accept rank and world_size as input parameters, it will only load the sharded data that required by the node 2. we take the shard based on rank + padding so that we don't need to know the dataset size ahead. offset = rank * (datasize // world_size) + min(rank, datasize % world_size) len = datasize // world_size + (1 if rank < datasize % world_size else 0) ``` The intention of this diff is to reduce the memory usage when each node load the sharded dataset. The current implmentation is that every node will load the whole dataset into memory and then take the shard, which could cause OOM issue because num_gpus * dataset_size This diff enabled that 1. each node will only load the sharded dataset into memory, which means the total memory usage should approximate same when compare multi gpus and single gpu 2. we take the shard ranged based on formula offset = rank * (datasize // world_size) + min(rank, datasize % world_size) and shard_len =datasize // world_size + (1 if rank < datasize % world_size else 0) , and we might need to pad one more example in some sharded dataset Example dataset = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] world_size = 3 shard_1 = [1, 2, 3, 4] shard_2 = [5, 6, 7, 7] shard_3 = [8, 9, 10, 10] The benefits of this Sharding + Padding approach is that 1. It doesn't require us to know the total dataset size in advance 2. The padding guarantee that each shard have the same number of examples which means we don't need to handle potential batch different issue 3. For every single shard, the maximum pad is 1 which is negligible when dataset size is large To be aware, the current hiveio API is not streamed, so there will still be OOM issue for hive reader even the dataset could fits in memory Differential Revision: D13644994 fbshipit-source-id: c710f50398ad6fdab059ec1604be047f449c3981

Summary: Pull Request resolved: facebookresearch#203 ``` Mainly Change 1. read_from_file and hive_reader will accept rank and world_size as input parameters, it will only load the sharded data that required by the node 2. we take the shard based on rank + padding so that we don't need to know the dataset size ahead. offset = rank * (datasize // world_size) + min(rank, datasize % world_size) len = datasize // world_size + (1 if rank < datasize % world_size else 0) ``` The intention of this diff is to reduce the memory usage when each node load the sharded dataset. The current implmentation is that every node will load the whole dataset into memory and then take the shard, which could cause OOM issue because num_gpus * dataset_size This diff enabled that 1. each node will only load the sharded dataset into memory, which means the total memory usage should approximate same when compare multi gpus and single gpu 2. we take the shard ranged based on formula offset = rank * (datasize // world_size) + min(rank, datasize % world_size) and shard_len =datasize // world_size + (1 if rank < datasize % world_size else 0) , and we might need to pad one more example in some sharded dataset Example dataset = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] world_size = 3 shard_1 = [1, 2, 3, 4] shard_2 = [5, 6, 7, 7] shard_3 = [8, 9, 10, 10] The benefits of this Sharding + Padding approach is that 1. It doesn't require us to know the total dataset size in advance 2. The padding guarantee that each shard have the same number of examples which means we don't need to handle potential batch different issue 3. For every single shard, the maximum pad is 1 which is negligible when dataset size is large To be aware, the current hiveio API is not streamed, so there will still be OOM issue for hive reader even the dataset could fits in memory Reviewed By: ahhegazy Differential Revision: D13644994 fbshipit-source-id: 60b2b1dd25edca0071c9d88d6d6929d117e78d40

Summary: Pull Request resolved: facebookresearch#203 ``` Mainly Change 1. read_from_file and hive_reader will accept rank and world_size as input parameters, it will only load the sharded data that required by the node 2. we take the shard based on rank + padding so that we don't need to know the dataset size ahead. offset = rank * (datasize // world_size) + min(rank, datasize % world_size) len = datasize // world_size + (1 if rank < datasize % world_size else 0) ``` The intention of this diff is to reduce the memory usage when each node load the sharded dataset. The current implmentation is that every node will load the whole dataset into memory and then take the shard, which could cause OOM issue because num_gpus * dataset_size This diff enabled that 1. each node will only load the sharded dataset into memory, which means the total memory usage should approximate same when compare multi gpus and single gpu 2. we take the shard ranged based on formula offset = rank * (datasize // world_size) + min(rank, datasize % world_size) and shard_len =datasize // world_size + (1 if rank < datasize % world_size else 0) , and we might need to pad one more example in some sharded dataset Example dataset = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] world_size = 3 shard_1 = [1, 2, 3, 4] shard_2 = [5, 6, 7, 7] shard_3 = [8, 9, 10, 10] The benefits of this Sharding + Padding approach is that 1. It doesn't require us to know the total dataset size in advance 2. The padding guarantee that each shard have the same number of examples which means we don't need to handle potential batch different issue 3. For every single shard, the maximum pad is 1 which is negligible when dataset size is large To be aware, the current hiveio API is not streamed, so there will still be OOM issue for hive reader even the dataset could fits in memory Reviewed By: ahhegazy Differential Revision: D13644994 fbshipit-source-id: a92c3e25e2758cd961bb2361f909e41b9fe8ecd2

facebook-github-bot added the CLA Signed Do not delete this pull request or issue due to inactivity. label Jan 12, 2019

chenyangyu1988 force-pushed the export-D13644994 branch from c8bdc7d to 496d7a6 Compare January 16, 2019 21:12

chenyangyu1988 force-pushed the export-D13644994 branch from 496d7a6 to 902cf97 Compare January 16, 2019 23:37

chenyangyu1988 force-pushed the export-D13644994 branch from 902cf97 to 06c59b5 Compare January 17, 2019 04:25

chenyangyu1988 force-pushed the export-D13644994 branch from 06c59b5 to f894dcf Compare January 17, 2019 04:27

facebook-github-bot closed this in cc13ea2 Jan 17, 2019

chenyangyu1988 deleted the export-D13644994 branch May 15, 2019 17:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

load dataset shard for training #203

load dataset shard for training #203

chenyangyu1988 commented Jan 12, 2019

load dataset shard for training #203

load dataset shard for training #203

Conversation

chenyangyu1988 commented Jan 12, 2019