[tune] Fix trial result fetching #4219

hartikainen · 2019-03-02T04:56:59Z

What do these changes do?

Fixes trial result fetching in RayTrialExecutor by shuffling the results fetch objects.

Related issue number

hartikainen · 2019-03-02T04:59:51Z

python/ray/tune/ray_trial_executor.py

+        # the first available result, and we want to guarantee that slower
+        # trials (i.e. trials that run remotely) also get fairly reported.
+        # See https://github.com/ray-project/ray/issues/4211 for details.
+        [result_id], _ = ray.wait(shuffled_results)


@ericl I don't think this needs a timeout. If I understand correctly, it's not the timeout for ray.wait() that matters in #1128 but the time.sleep() on the worker side. It seems like timeout wouldn't really make difference here. Correct me if I'm wrong.

You're probably right. The timeout was only relevant due to the backend bug. Since that is now fixed, the cause of the wait unfairness is likely just the ordering.

@robertnishihara I think one potential fix for this gotcha would be to return results in priority order by completion time, instead of order of ids passed in. I can see a lot of users running into this by accident.

ericl · 2019-03-02T05:02:47Z

python/ray/tune/ray_trial_executor.py

@@ -216,7 +217,13 @@ def get_running_trials(self):
        return list(self._running.values())

    def get_next_available_trial(self):
-        [result_id], _ = ray.wait(list(self._running))
+        shuffled_results = random.sample(
+            self._running.keys(), len(self._running))


nit: use random.shuffle() for clarity

hartikainen · 2019-03-02T05:20:02Z

This seems to solve my problem. Here's what the worker cpu usages look like with these changes:

Unrelated to this issue, but the autoscaler start up seems extremely slow. It took >20 minutes to fully start up even though I defined initial_workers=50 (same as max_workers) in the config.

AmplabJenkins · 2019-03-02T09:21:26Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12461/
Test FAILed.

AmplabJenkins · 2019-03-03T01:42:26Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12480/
Test FAILed.

AmplabJenkins · 2019-03-03T03:02:03Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12479/
Test FAILed.

hartikainen commented Mar 2, 2019

View reviewed changes

ericl reviewed Mar 2, 2019

View reviewed changes

ericl self-assigned this Mar 2, 2019

hartikainen force-pushed the fix-trial-result-wait branch from facdf64 to f15ee40 Compare March 3, 2019 00:00

hartikainen added 4 commits March 2, 2019 16:06

Fix trial results wait in RayTrialExecutor.get_next_available_trial

af73ef7

Add comment for the results shuffling

68bc308

Remove timeout from the wait

bda1393

Change random.sample to random.shuffle

a425faa

hartikainen force-pushed the fix-trial-result-wait branch from f15ee40 to a425faa Compare March 3, 2019 00:06

hartikainen changed the title ~~[WIP] Fixes trial result fetching~~ Fix trial result fetching Mar 3, 2019

hartikainen changed the title ~~Fix trial result fetching~~ [tune] Fix trial result fetching Mar 3, 2019

richardliaw approved these changes Mar 3, 2019

View reviewed changes

richardliaw merged commit df9beb7 into ray-project:master Mar 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tune] Fix trial result fetching #4219

[tune] Fix trial result fetching #4219

hartikainen commented Mar 2, 2019 •

edited

Loading

hartikainen Mar 2, 2019

ericl Mar 2, 2019 •

edited

Loading

ericl Mar 2, 2019

hartikainen commented Mar 2, 2019 •

edited

Loading

AmplabJenkins commented Mar 2, 2019

AmplabJenkins commented Mar 3, 2019

AmplabJenkins commented Mar 3, 2019

[tune] Fix trial result fetching #4219

[tune] Fix trial result fetching #4219

Conversation

hartikainen commented Mar 2, 2019 • edited Loading

What do these changes do?

Related issue number

hartikainen Mar 2, 2019

Choose a reason for hiding this comment

ericl Mar 2, 2019 • edited Loading

Choose a reason for hiding this comment

ericl Mar 2, 2019

Choose a reason for hiding this comment

hartikainen commented Mar 2, 2019 • edited Loading

AmplabJenkins commented Mar 2, 2019

AmplabJenkins commented Mar 3, 2019

AmplabJenkins commented Mar 3, 2019

hartikainen commented Mar 2, 2019 •

edited

Loading

ericl Mar 2, 2019 •

edited

Loading

hartikainen commented Mar 2, 2019 •

edited

Loading