Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tune] Fix trial result fetching #4219

Merged
merged 4 commits into from
Mar 4, 2019
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 8 additions & 1 deletion python/ray/tune/ray_trial_executor.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@

import logging
import os
import random
import time
import traceback

Expand Down Expand Up @@ -216,7 +217,13 @@ def get_running_trials(self):
return list(self._running.values())

def get_next_available_trial(self):
[result_id], _ = ray.wait(list(self._running))
shuffled_results = list(self._running.keys())
random.shuffle(shuffled_results)
# Note: We shuffle the results because `ray.wait` by default returns
# the first available result, and we want to guarantee that slower
# trials (i.e. trials that run remotely) also get fairly reported.
# See https://github.com/ray-project/ray/issues/4211 for details.
[result_id], _ = ray.wait(shuffled_results)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericl I don't think this needs a timeout. If I understand correctly, it's not the timeout for ray.wait() that matters in #1128 but the time.sleep() on the worker side. It seems like timeout wouldn't really make difference here. Correct me if I'm wrong.

Copy link
Contributor

@ericl ericl Mar 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're probably right. The timeout was only relevant due to the backend bug. Since that is now fixed, the cause of the wait unfairness is likely just the ordering.

@robertnishihara I think one potential fix for this gotcha would be to return results in priority order by completion time, instead of order of ids passed in. I can see a lot of users running into this by accident.

return self._running[result_id]

def fetch_result(self, trial):
Expand Down