Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tune] Fix trial result fetching #4219

Merged
merged 4 commits into from
Mar 4, 2019

Conversation

hartikainen
Copy link
Contributor

@hartikainen hartikainen commented Mar 2, 2019

What do these changes do?

Fixes trial result fetching in RayTrialExecutor by shuffling the results fetch objects.

Related issue number

#4211

# the first available result, and we want to guarantee that slower
# trials (i.e. trials that run remotely) also get fairly reported.
# See https://github.com/ray-project/ray/issues/4211 for details.
[result_id], _ = ray.wait(shuffled_results)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericl I don't think this needs a timeout. If I understand correctly, it's not the timeout for ray.wait() that matters in #1128 but the time.sleep() on the worker side. It seems like timeout wouldn't really make difference here. Correct me if I'm wrong.

Copy link
Contributor

@ericl ericl Mar 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're probably right. The timeout was only relevant due to the backend bug. Since that is now fixed, the cause of the wait unfairness is likely just the ordering.

@robertnishihara I think one potential fix for this gotcha would be to return results in priority order by completion time, instead of order of ids passed in. I can see a lot of users running into this by accident.

@@ -216,7 +217,13 @@ def get_running_trials(self):
return list(self._running.values())

def get_next_available_trial(self):
[result_id], _ = ray.wait(list(self._running))
shuffled_results = random.sample(
self._running.keys(), len(self._running))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use random.shuffle() for clarity

@ericl ericl self-assigned this Mar 2, 2019
@hartikainen
Copy link
Contributor Author

hartikainen commented Mar 2, 2019

This seems to solve my problem. Here's what the worker cpu usages look like with these changes:
image

Unrelated to this issue, but the autoscaler start up seems extremely slow. It took >20 minutes to fully start up even though I defined initial_workers=50 (same as max_workers) in the config.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12461/
Test FAILed.

@hartikainen hartikainen force-pushed the fix-trial-result-wait branch from facdf64 to f15ee40 Compare March 3, 2019 00:00
@hartikainen hartikainen force-pushed the fix-trial-result-wait branch from f15ee40 to a425faa Compare March 3, 2019 00:06
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12480/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12479/
Test FAILed.

@hartikainen hartikainen changed the title [WIP] Fixes trial result fetching Fix trial result fetching Mar 3, 2019
@hartikainen hartikainen changed the title Fix trial result fetching [tune] Fix trial result fetching Mar 3, 2019
@richardliaw richardliaw merged commit df9beb7 into ray-project:master Mar 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants