This repository was archived by the owner on Nov 22, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 797
load dataset shard for training #203
Closed
chenyangyu1988
wants to merge
1
commit into
facebookresearch:master
from
chenyangyu1988:export-D13644994
Closed
load dataset shard for training #203
chenyangyu1988
wants to merge
1
commit into
facebookresearch:master
from
chenyangyu1988:export-D13644994
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
c8bdc7d
to
496d7a6
Compare
chenyangyu1988
added a commit
to chenyangyu1988/pytext
that referenced
this pull request
Jan 16, 2019
Summary: Pull Request resolved: facebookresearch#203 ``` Mainly Change 1. read_from_file and hive_reader will accept rank and world_size as input parameters, it will only load the sharded data that required by the node 2. we take the shard based on rank + padding so that we don't need to know the dataset size ahead. offset = rank * (datasize // world_size) + min(rank, datasize % world_size) len = datasize // world_size + (1 if rank < datasize % world_size else 0) ``` The intention of this diff is to reduce the memory usage when each node load the sharded dataset. The current implmentation is that every node will load the whole dataset into memory and then take the shard, which could cause OOM issue because num_gpus * dataset_size This diff enabled that 1. each node will only load the sharded dataset into memory, which means the total memory usage should approximate same when compare multi gpus and single gpu 2. we take the shard based on [rank, rank + world_size * 1, rank + world_size * 2, ....], and we might need to pad one more example in some sharded dataset Example dataset = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] world_size = 3 shard_1 = [1, 2, 3, 4] shard_2 = [5, 6, 7, 7] shard_3 = [8, 9, 10, 10] The benefits of this Sharding + Padding approach is that 1. It doesn't require us to know the total dataset size in advance 2. The padding guarantee that each shard have the same number of examples which means we don't need to handle potential batch different issue 3. For every single shard, the maximum pad is 1 which is negligible when dataset size is large To be aware, the current hiveio API is not streamed, so there will still be OOM issue for hive reader even the dataset could fits in memory Differential Revision: D13644994 fbshipit-source-id: 1d84eefa78c13c9867ef38a378855ffde2295795
496d7a6
to
902cf97
Compare
chenyangyu1988
added a commit
to chenyangyu1988/pytext
that referenced
this pull request
Jan 16, 2019
Summary: Pull Request resolved: facebookresearch#203 ``` Mainly Change 1. read_from_file and hive_reader will accept rank and world_size as input parameters, it will only load the sharded data that required by the node 2. we take the shard based on rank + padding so that we don't need to know the dataset size ahead. offset = rank * (datasize // world_size) + min(rank, datasize % world_size) len = datasize // world_size + (1 if rank < datasize % world_size else 0) ``` The intention of this diff is to reduce the memory usage when each node load the sharded dataset. The current implmentation is that every node will load the whole dataset into memory and then take the shard, which could cause OOM issue because num_gpus * dataset_size This diff enabled that 1. each node will only load the sharded dataset into memory, which means the total memory usage should approximate same when compare multi gpus and single gpu 2. we take the shard ranged based on formula offset = rank * (datasize // world_size) + min(rank, datasize % world_size) and shard_len =datasize // world_size + (1 if rank < datasize % world_size else 0) , and we might need to pad one more example in some sharded dataset Example dataset = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] world_size = 3 shard_1 = [1, 2, 3, 4] shard_2 = [5, 6, 7, 7] shard_3 = [8, 9, 10, 10] The benefits of this Sharding + Padding approach is that 1. It doesn't require us to know the total dataset size in advance 2. The padding guarantee that each shard have the same number of examples which means we don't need to handle potential batch different issue 3. For every single shard, the maximum pad is 1 which is negligible when dataset size is large To be aware, the current hiveio API is not streamed, so there will still be OOM issue for hive reader even the dataset could fits in memory Differential Revision: D13644994 fbshipit-source-id: c710f50398ad6fdab059ec1604be047f449c3981
902cf97
to
06c59b5
Compare
chenyangyu1988
added a commit
to chenyangyu1988/pytext
that referenced
this pull request
Jan 17, 2019
Summary: Pull Request resolved: facebookresearch#203 ``` Mainly Change 1. read_from_file and hive_reader will accept rank and world_size as input parameters, it will only load the sharded data that required by the node 2. we take the shard based on rank + padding so that we don't need to know the dataset size ahead. offset = rank * (datasize // world_size) + min(rank, datasize % world_size) len = datasize // world_size + (1 if rank < datasize % world_size else 0) ``` The intention of this diff is to reduce the memory usage when each node load the sharded dataset. The current implmentation is that every node will load the whole dataset into memory and then take the shard, which could cause OOM issue because num_gpus * dataset_size This diff enabled that 1. each node will only load the sharded dataset into memory, which means the total memory usage should approximate same when compare multi gpus and single gpu 2. we take the shard ranged based on formula offset = rank * (datasize // world_size) + min(rank, datasize % world_size) and shard_len =datasize // world_size + (1 if rank < datasize % world_size else 0) , and we might need to pad one more example in some sharded dataset Example dataset = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] world_size = 3 shard_1 = [1, 2, 3, 4] shard_2 = [5, 6, 7, 7] shard_3 = [8, 9, 10, 10] The benefits of this Sharding + Padding approach is that 1. It doesn't require us to know the total dataset size in advance 2. The padding guarantee that each shard have the same number of examples which means we don't need to handle potential batch different issue 3. For every single shard, the maximum pad is 1 which is negligible when dataset size is large To be aware, the current hiveio API is not streamed, so there will still be OOM issue for hive reader even the dataset could fits in memory Reviewed By: ahhegazy Differential Revision: D13644994 fbshipit-source-id: 60b2b1dd25edca0071c9d88d6d6929d117e78d40
Summary: Pull Request resolved: facebookresearch#203 ``` Mainly Change 1. read_from_file and hive_reader will accept rank and world_size as input parameters, it will only load the sharded data that required by the node 2. we take the shard based on rank + padding so that we don't need to know the dataset size ahead. offset = rank * (datasize // world_size) + min(rank, datasize % world_size) len = datasize // world_size + (1 if rank < datasize % world_size else 0) ``` The intention of this diff is to reduce the memory usage when each node load the sharded dataset. The current implmentation is that every node will load the whole dataset into memory and then take the shard, which could cause OOM issue because num_gpus * dataset_size This diff enabled that 1. each node will only load the sharded dataset into memory, which means the total memory usage should approximate same when compare multi gpus and single gpu 2. we take the shard ranged based on formula offset = rank * (datasize // world_size) + min(rank, datasize % world_size) and shard_len =datasize // world_size + (1 if rank < datasize % world_size else 0) , and we might need to pad one more example in some sharded dataset Example dataset = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] world_size = 3 shard_1 = [1, 2, 3, 4] shard_2 = [5, 6, 7, 7] shard_3 = [8, 9, 10, 10] The benefits of this Sharding + Padding approach is that 1. It doesn't require us to know the total dataset size in advance 2. The padding guarantee that each shard have the same number of examples which means we don't need to handle potential batch different issue 3. For every single shard, the maximum pad is 1 which is negligible when dataset size is large To be aware, the current hiveio API is not streamed, so there will still be OOM issue for hive reader even the dataset could fits in memory Reviewed By: ahhegazy Differential Revision: D13644994 fbshipit-source-id: a92c3e25e2758cd961bb2361f909e41b9fe8ecd2
06c59b5
to
f894dcf
Compare
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary:
The intention of this diff is to reduce the memory usage when each node load the sharded dataset.
The current implmentation is that every node will load the whole dataset into memory and then take the shard, which could cause OOM issue because num_gpus * dataset_size
This diff enabled that
Example
dataset = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
world_size = 3
shard_1 = [1, 4, 7, 10]
shard_2 = [2, 5, 8, 8]
shard_3 = [3, 6, 9, 9]
The benefits of this Sharding + Padding approach is that
To be aware, the current hiveio API is not streamed, so there will still be OOM issue for hive reader even the dataset could fits in memory
Differential Revision: D13644994