Skip to content
This repository was archived by the owner on Nov 22, 2022. It is now read-only.

load dataset shard for training #203

Conversation

chenyangyu1988
Copy link
Contributor

Summary:
The intention of this diff is to reduce the memory usage when each node load the sharded dataset.

The current implmentation is that every node will load the whole dataset into memory and then take the shard, which could cause OOM issue because num_gpus * dataset_size

This diff enabled that

  1. each node will only load the sharded dataset into memory, which means the total memory usage should approximate same when compare multi gpus and single gpu
  2. we take the shard based on [rank, rank + world_size * 1, rank + world_size * 2, ....], and we might need to pad one more example in some sharded dataset

Example
dataset = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
world_size = 3

shard_1 = [1, 4, 7, 10]
shard_2 = [2, 5, 8, 8]
shard_3 = [3, 6, 9, 9]

The benefits of this Sharding + Padding approach is that

  1. It doesn't require us to know the total dataset size in advance
  2. The padding guarantee that each shard have the same number of examples which means we don't need to handle potential batch different issue
  3. For every single shard, the maximum pad is 1 which is negligible when dataset size is large

To be aware, the current hiveio API is not streamed, so there will still be OOM issue for hive reader even the dataset could fits in memory

Differential Revision: D13644994

@facebook-github-bot facebook-github-bot added the CLA Signed Do not delete this pull request or issue due to inactivity. label Jan 12, 2019
chenyangyu1988 added a commit to chenyangyu1988/pytext that referenced this pull request Jan 16, 2019
Summary:
Pull Request resolved: facebookresearch#203

```
Mainly Change
1. read_from_file and hive_reader will accept rank and world_size as input parameters, it will only load the sharded data that required by the node
2. we take the shard based on rank + padding so that we don't need to know the dataset size ahead.

offset = rank * (datasize // world_size) + min(rank, datasize % world_size)
len = datasize // world_size + (1 if rank < datasize % world_size else 0)
```

The intention of this diff is to reduce the memory usage when each node load the sharded dataset.

The current implmentation is that every node will load the whole dataset into memory and then take the shard, which could cause OOM issue because num_gpus * dataset_size

This diff enabled that
1. each node will only load the sharded dataset into memory, which means the total memory usage should approximate same when compare multi gpus and single gpu
2. we take the shard based on [rank, rank + world_size * 1, rank + world_size * 2, ....], and we might need to pad one more example in some sharded dataset

Example
dataset = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
world_size = 3

shard_1 = [1, 2, 3, 4]
shard_2 = [5, 6, 7, 7]
shard_3 = [8, 9, 10, 10]

The benefits of this Sharding + Padding approach is that
1. It doesn't require us to know the total dataset size in advance
2. The padding guarantee that each shard have the same number of examples which means we don't need to handle potential batch different issue
3. For every single shard, the maximum pad is 1 which is negligible when dataset size is large

To be aware, the current hiveio API is not streamed, so there will still be OOM issue for hive reader even the dataset could fits in memory

Differential Revision: D13644994

fbshipit-source-id: 1d84eefa78c13c9867ef38a378855ffde2295795
chenyangyu1988 added a commit to chenyangyu1988/pytext that referenced this pull request Jan 16, 2019
Summary:
Pull Request resolved: facebookresearch#203

```
Mainly Change
1. read_from_file and hive_reader will accept rank and world_size as input parameters, it will only load the sharded data that required by the node
2. we take the shard based on rank + padding so that we don't need to know the dataset size ahead.

offset = rank * (datasize // world_size) + min(rank, datasize % world_size)
len = datasize // world_size + (1 if rank < datasize % world_size else 0)
```

The intention of this diff is to reduce the memory usage when each node load the sharded dataset.

The current implmentation is that every node will load the whole dataset into memory and then take the shard, which could cause OOM issue because num_gpus * dataset_size

This diff enabled that
1. each node will only load the sharded dataset into memory, which means the total memory usage should approximate same when compare multi gpus and single gpu
2. we take the shard ranged based on formula offset = rank * (datasize // world_size) + min(rank, datasize % world_size) and shard_len =datasize // world_size + (1 if rank < datasize % world_size else 0) , and we might need to pad one more example in some sharded dataset

Example
dataset = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
world_size = 3

shard_1 = [1, 2, 3, 4]
shard_2 = [5, 6, 7, 7]
shard_3 = [8, 9, 10, 10]

The benefits of this Sharding + Padding approach is that
1. It doesn't require us to know the total dataset size in advance
2. The padding guarantee that each shard have the same number of examples which means we don't need to handle potential batch different issue
3. For every single shard, the maximum pad is 1 which is negligible when dataset size is large

To be aware, the current hiveio API is not streamed, so there will still be OOM issue for hive reader even the dataset could fits in memory

Differential Revision: D13644994

fbshipit-source-id: c710f50398ad6fdab059ec1604be047f449c3981
chenyangyu1988 added a commit to chenyangyu1988/pytext that referenced this pull request Jan 17, 2019
Summary:
Pull Request resolved: facebookresearch#203

```
Mainly Change
1. read_from_file and hive_reader will accept rank and world_size as input parameters, it will only load the sharded data that required by the node
2. we take the shard based on rank + padding so that we don't need to know the dataset size ahead.

offset = rank * (datasize // world_size) + min(rank, datasize % world_size)
len = datasize // world_size + (1 if rank < datasize % world_size else 0)
```

The intention of this diff is to reduce the memory usage when each node load the sharded dataset.

The current implmentation is that every node will load the whole dataset into memory and then take the shard, which could cause OOM issue because num_gpus * dataset_size

This diff enabled that
1. each node will only load the sharded dataset into memory, which means the total memory usage should approximate same when compare multi gpus and single gpu
2. we take the shard ranged based on formula offset = rank * (datasize // world_size) + min(rank, datasize % world_size) and shard_len =datasize // world_size + (1 if rank < datasize % world_size else 0) , and we might need to pad one more example in some sharded dataset

Example
dataset = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
world_size = 3

shard_1 = [1, 2, 3, 4]
shard_2 = [5, 6, 7, 7]
shard_3 = [8, 9, 10, 10]

The benefits of this Sharding + Padding approach is that
1. It doesn't require us to know the total dataset size in advance
2. The padding guarantee that each shard have the same number of examples which means we don't need to handle potential batch different issue
3. For every single shard, the maximum pad is 1 which is negligible when dataset size is large

To be aware, the current hiveio API is not streamed, so there will still be OOM issue for hive reader even the dataset could fits in memory

Reviewed By: ahhegazy

Differential Revision: D13644994

fbshipit-source-id: 60b2b1dd25edca0071c9d88d6d6929d117e78d40
Summary:
Pull Request resolved: facebookresearch#203

```
Mainly Change
1. read_from_file and hive_reader will accept rank and world_size as input parameters, it will only load the sharded data that required by the node
2. we take the shard based on rank + padding so that we don't need to know the dataset size ahead.

offset = rank * (datasize // world_size) + min(rank, datasize % world_size)
len = datasize // world_size + (1 if rank < datasize % world_size else 0)
```

The intention of this diff is to reduce the memory usage when each node load the sharded dataset.

The current implmentation is that every node will load the whole dataset into memory and then take the shard, which could cause OOM issue because num_gpus * dataset_size

This diff enabled that
1. each node will only load the sharded dataset into memory, which means the total memory usage should approximate same when compare multi gpus and single gpu
2. we take the shard ranged based on formula offset = rank * (datasize // world_size) + min(rank, datasize % world_size) and shard_len =datasize // world_size + (1 if rank < datasize % world_size else 0) , and we might need to pad one more example in some sharded dataset

Example
dataset = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
world_size = 3

shard_1 = [1, 2, 3, 4]
shard_2 = [5, 6, 7, 7]
shard_3 = [8, 9, 10, 10]

The benefits of this Sharding + Padding approach is that
1. It doesn't require us to know the total dataset size in advance
2. The padding guarantee that each shard have the same number of examples which means we don't need to handle potential batch different issue
3. For every single shard, the maximum pad is 1 which is negligible when dataset size is large

To be aware, the current hiveio API is not streamed, so there will still be OOM issue for hive reader even the dataset could fits in memory

Reviewed By: ahhegazy

Differential Revision: D13644994

fbshipit-source-id: a92c3e25e2758cd961bb2361f909e41b9fe8ecd2
@chenyangyu1988 chenyangyu1988 deleted the export-D13644994 branch May 15, 2019 17:41
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
CLA Signed Do not delete this pull request or issue due to inactivity.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants