Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data Loading] Round-based per-epoch shuffling data loader for distributed training. #15531

Closed
wants to merge 30 commits into from

Conversation

clarkzinzow
Copy link
Contributor

A round-based per-epoch shuffling data loader for distributed training. I'm opening this PR early in order to facilitate easier collaboration.

@clarkzinzow clarkzinzow added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Apr 27, 2021
@clarkzinzow clarkzinzow force-pushed the uber-shuffle branch 2 times, most recently from 409ecd0 to 4d6a43f Compare April 28, 2021 03:29
@richardliaw

This comment has been minimized.

@richardliaw
Copy link
Contributor

I got it to work on 4 GPUs.

Nit. Seeing this warning:

2021-04-28 08:04:47,389 WARNING import_thread.py:133 -- The remote function 'ray.experimental.data_loader.shuffle.consume' has been exported 100 times. It's possible that this warning is accidental, but this may indicate that the same remote function is being defined repeatedly from within many tasks and exported to all of the workers. This can be a performance issue and can be resolved by defining the remote function on the driver instead. See #6240 for more discussion.

@richardliaw
Copy link
Contributor

richardliaw commented Apr 28, 2021

Example working here: https://gist.github.com/9040abcc654ce6b5ed817b8263d723e2

I also had to add this to get the workload to work:

diff --git a/python/ray/experimental/data_loader/multiqueue.py b/python/ray/experimental/data_loader/multiqueue.py
index 9e0cb0394..4b70f1c2e 100644
--- a/python/ray/experimental/data_loader/multiqueue.py
+++ b/python/ray/experimental/data_loader/multiqueue.py
@@ -3,7 +3,9 @@ from typing import Optional, Any, List, Dict
 from collections.abc import Iterable

 import ray
-
+import logging
+import time
+logger = logging.getLogger(__name__)

 class Empty(Exception):
     pass
@@ -49,13 +51,19 @@ class MultiQueue:

     def __init__(self, num_queues: int, maxsize: int = 0,
                  name: str = None, connect: bool = False,
-                 actor_options: Optional[Dict] = None) -> None:
+                 actor_options: Optional[Dict] = None, retries=5) -> None:
         self.num_queues = num_queues
         self.maxsize = maxsize
         if connect:
             assert actor_options is None
             assert name is not None
-            self.actor = ray.get_actor(name)
+            for i in range(retries):
+                try:
+                    self.actor = ray.get_actor(name)
+                except ValueError:
+                    logger.info(
+                        f"Did not acquire actor. Trying again [{i}/{retries}.")
+                    time.sleep(3)
         else:
             actor_options = actor_options or {}
             if name is not None:

@richardliaw
Copy link
Contributor

richardliaw commented Apr 28, 2021

Trying now on 16 GPU cluster with the provided example. A couple notes:

  • Sometimes I see a hang. Not sure why.
  • Not sure how to set num_reducers and num_mappers. It'd be great to have some good defaults so that users won't need to touch that (usually this will be scaffolded away so deeply that users can't touch it anyways).
  • You can use this cluster: https://gist.github.com/c8bf8fd7eab92d89ea71c612377ad29d
ray up -y cluster
ray submit cluster ray_torch_shuffle.py

Copy link
Contributor

@stephanie-wang stephanie-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just reviewed shuffle.py mainly.

python/ray/experimental/data_loader/shuffle.py Outdated Show resolved Hide resolved
python/ray/experimental/data_loader/shuffle.py Outdated Show resolved Hide resolved
python/ray/experimental/data_loader/shuffle.py Outdated Show resolved Hide resolved
python/ray/experimental/data_loader/shuffle.py Outdated Show resolved Hide resolved
python/ray/experimental/data_loader/shuffle.py Outdated Show resolved Hide resolved
python/ray/experimental/data_loader/shuffle.py Outdated Show resolved Hide resolved
python/ray/experimental/data_loader/dataset.py Outdated Show resolved Hide resolved
@clarkzinzow clarkzinzow force-pushed the uber-shuffle branch 4 times, most recently from b1b03a5 to a8686f2 Compare May 4, 2021 02:49
@clarkzinzow
Copy link
Contributor Author

Moved to an external repo.

@clarkzinzow clarkzinzow closed this May 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants