[Data Loading] Round-based per-epoch shuffling data loader for distributed training. #15531

clarkzinzow · 2021-04-27T04:34:25Z

A round-based per-epoch shuffling data loader for distributed training. I'm opening this PR early in order to facilitate easier collaboration.

…t, add shuffle round deduction from batch size, add support for multiple batches per round, and many other misc. additions.

…d consumer stats CSVs.

…batch size in stats file name, include number of rounds and min/max stats in benchmark stats.

…ration API.

…oto object bug.

richardliaw · 2021-04-28T08:06:53Z

I got it to work on 4 GPUs.

Nit. Seeing this warning:

2021-04-28 08:04:47,389 WARNING import_thread.py:133 -- The remote function 'ray.experimental.data_loader.shuffle.consume' has been exported 100 times. It's possible that this warning is accidental, but this may indicate that the same remote function is being defined repeatedly from within many tasks and exported to all of the workers. This can be a performance issue and can be resolved by defining the remote function on the driver instead. See #6240 for more discussion.

richardliaw · 2021-04-28T08:30:34Z

Example working here: https://gist.github.com/9040abcc654ce6b5ed817b8263d723e2

I also had to add this to get the workload to work:

diff --git a/python/ray/experimental/data_loader/multiqueue.py b/python/ray/experimental/data_loader/multiqueue.py
index 9e0cb0394..4b70f1c2e 100644
--- a/python/ray/experimental/data_loader/multiqueue.py
+++ b/python/ray/experimental/data_loader/multiqueue.py
@@ -3,7 +3,9 @@ from typing import Optional, Any, List, Dict
 from collections.abc import Iterable

 import ray
-
+import logging
+import time
+logger = logging.getLogger(__name__)

 class Empty(Exception):
     pass
@@ -49,13 +51,19 @@ class MultiQueue:

     def __init__(self, num_queues: int, maxsize: int = 0,
                  name: str = None, connect: bool = False,
-                 actor_options: Optional[Dict] = None) -> None:
+                 actor_options: Optional[Dict] = None, retries=5) -> None:
         self.num_queues = num_queues
         self.maxsize = maxsize
         if connect:
             assert actor_options is None
             assert name is not None
-            self.actor = ray.get_actor(name)
+            for i in range(retries):
+                try:
+                    self.actor = ray.get_actor(name)
+                except ValueError:
+                    logger.info(
+                        f"Did not acquire actor. Trying again [{i}/{retries}.")
+                    time.sleep(3)
         else:
             actor_options = actor_options or {}
             if name is not None:

richardliaw · 2021-04-28T08:33:09Z

Trying now on 16 GPU cluster with the provided example. A couple notes:

Sometimes I see a hang. Not sure why.
Not sure how to set num_reducers and num_mappers. It'd be great to have some good defaults so that users won't need to touch that (usually this will be scaffolded away so deeply that users can't touch it anyways).
You can use this cluster: https://gist.github.com/c8bf8fd7eab92d89ea71c612377ad29d

ray up -y cluster
ray submit cluster ray_torch_shuffle.py

stephanie-wang

Just reviewed shuffle.py mainly.

python/ray/experimental/data_loader/shuffle.py

python/ray/experimental/data_loader/dataset.py

…pdate cluster config for multi-node benchmarking.

…nd tasks; removed max_concurrent_rounds; port cache mapper to num_rounds * num_reducers return values.

…educer --> trainer shuffle.

clarkzinzow · 2021-05-04T23:24:54Z

Moved to an external repo.

clarkzinzow assigned clarkzinzow and unassigned clarkzinzow Apr 27, 2021

clarkzinzow added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Apr 27, 2021

stephanie-wang and others added 21 commits April 27, 2021 22:25

Test scripts

09a4caa

Consolidate data generation and benchmarks, extend benchmarking scrip…

cc18afa

…t, add shuffle round deduction from batch size, add support for multiple batches per round, and many other misc. additions.

Misc. updates.

d89d963

Send only relevant chunks to reducers.

2a86123

Add from-memory shuffle, add throughput stats collection.

8e73711

Updates from paired programming session.

af2193f

Measure consumer times from the start of the round.

f48b338

Measure consumer times from the start of the epoch.

3f3fd5c

Add instrumentation of the shuffle stages, write out trial, round, an…

4e73f12

…d consumer stats CSVs.

Benchmark batch script.

10670b5

Keep shuffle rounds stable with changing number of trainers, include …

863a183

…batch size in stats file name, include number of rounds and min/max stats in benchmark stats.

Added new TODOs.

332f687

Add benchmark results.

31ca63f

Add multi-epoch and pipeline throttling support to shufflers.

35baf30

Updated benchmark results.

5b05a05

Add support for collecting object store stats.

45cf38d

Add support for configurable mappers and reducers, simplify data gene…

4b07ac9

…ration API.

Make batch consumers plugable.

b98b76c

Add prototype dataset abstraction.

df2a6b5

Fix consumer batching, make stats collection optional, fix pickled pr…

1da710f

…oto object bug.

Reorged shuffling data loader benchmarks and implementation.

299bb6a

clarkzinzow force-pushed the uber-shuffle branch 2 times, most recently from 409ecd0 to 4d6a43f Compare April 28, 2021 03:29

This comment has been minimized.

Sign in to view

clarkzinzow force-pushed the uber-shuffle branch from adae570 to 9e1be62 Compare April 28, 2021 21:01

stephanie-wang reviewed Apr 29, 2021

View reviewed changes

clarkzinzow force-pushed the uber-shuffle branch 4 times, most recently from b1b03a5 to a8686f2 Compare May 4, 2021 02:49

clarkzinzow added 9 commits May 4, 2021 20:13

Added PyTorch iterable dataset integration.

d44e268

Queue object refs instead of actual data.

c75b1de

Moved to random sampling implementation.

a00ea2a

Consolidated cache map and map stages in from memory shuffle.

3f8b432

Use smart_open for S3 file reading/writing, use snappy compression, u…

3e25b4c

…pdate cluster config for multi-node benchmarking.

Refactored backpressure to not require nested shuffle, epoch, and rou…

62ae87e

…nd tasks; removed max_concurrent_rounds; port cache mapper to num_rounds * num_reducers return values.

Remove shuffle rounds, optimize pipeline backpressure, remove extra r…

ff90de1

…educer --> trainer shuffle.

Fix formatting.

769fd66

Focus on TorchShufflingDataset.

71b7d2d

clarkzinzow force-pushed the uber-shuffle branch from a8686f2 to 71b7d2d Compare May 4, 2021 20:15

clarkzinzow closed this May 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data Loading] Round-based per-epoch shuffling data loader for distributed training. #15531

[Data Loading] Round-based per-epoch shuffling data loader for distributed training. #15531

clarkzinzow commented Apr 27, 2021

This comment has been minimized.

richardliaw commented Apr 28, 2021

richardliaw commented Apr 28, 2021 •

edited

Loading

richardliaw commented Apr 28, 2021 •

edited

Loading

stephanie-wang left a comment

clarkzinzow commented May 4, 2021

[Data Loading] Round-based per-epoch shuffling data loader for distributed training. #15531

[Data Loading] Round-based per-epoch shuffling data loader for distributed training. #15531

Conversation

clarkzinzow commented Apr 27, 2021

This comment has been minimized.

richardliaw commented Apr 28, 2021

richardliaw commented Apr 28, 2021 • edited Loading

richardliaw commented Apr 28, 2021 • edited Loading

stephanie-wang left a comment

Choose a reason for hiding this comment

clarkzinzow commented May 4, 2021

richardliaw commented Apr 28, 2021 •

edited

Loading

richardliaw commented Apr 28, 2021 •

edited

Loading