Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[webui] Scalability fixes for the task timeline and visualizations #935

Merged
merged 7 commits into from
Sep 10, 2017

Conversation

ericl
Copy link
Contributor

@ericl ericl commented Sep 6, 2017

This PR ensures web ui responsiveness (#933) by adding a hard cap on the number of tasks returned (10000).

You can test with the following snippet which launches one million tasks locally. The UI should respond fluidly in all the visualizations except the timeline, which still should update within about 10 seconds.

import time
import ray

@ray.remote
def f(x):
    return x

print("Launching tasks")

ray.init()
for x in range(1000000):
    y = f.remote(x)
ray.get(y)
print("Done")

time.sleep(99999999)

cc @alanamarzoev @robertnishihara

@@ -562,8 +571,10 @@ def micros(ts):
def micros_rel(ts):
return micros(ts - start_time)

task_profiles = self.task_profiles(start=0, end=time.time())
Copy link
Contributor Author

@ericl ericl Sep 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was performing an unnecessary full table scan.

@@ -562,8 +571,10 @@ def micros(ts):
def micros_rel(ts):
return micros(ts - start_time)

task_profiles = self.task_profiles(start=0, end=time.time())
task_table = self.task_table()
Copy link
Contributor Author

@ericl ericl Sep 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was also a full table scan. The replacement is slower when the number of tasks is small, but has a bounded worst case latency.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is actually a way to get the best of both worlds with the SCAN [1] class of functions from redis (I guess that's what you mean with the TODO above, do you want to make it a little more precise?)

[1] https://redis.io/commands/scan

Copy link
Contributor Author

@ericl ericl Sep 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SCAN is still O(n), but MGET is bounded by the limit. I think the refactoring is a bit nontrivial but I updated the TODO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, sounds good!

["ParentTaskID"]]
["get_arguments_start"]),
"ts": micros_rel(
parent_profile and
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With truncation it's possible we don't have some parent profiles. We are just ignoring them for now.

@@ -419,17 +422,25 @@ def task_profiles(self, start=None, end=None, num_tasks=None, fwd=True):
task_info = dict()
event_log_sets = self.redis_client.keys("event_log*")

if num_tasks is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code should not be part of the GlobalState API and rather num_tasks should be passed in from the webui

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/1782/
Test PASSed.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/1783/
Test PASSed.

@ericl
Copy link
Contributor Author

ericl commented Sep 8, 2017

Looks like these are broken tests, I'll take a look tomorrow.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/1796/
Test PASSed.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/1800/
Test PASSed.

@ericl
Copy link
Contributor Author

ericl commented Sep 10, 2017

@pcmoritz looks like tests are passing. One of the travis builds was cancelled but I don't see failures.

@robertnishihara
Copy link
Collaborator

Nice! I'm trying it out now.

@robertnishihara
Copy link
Collaborator

I think I can generate key errors when I use timelines with task dependencies (the line numbers are sligtly off because I rebased the PR locally).

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/Workspace/ray/python/ray/experimental/ui.py in handle_submit(sender)
    442                                              breakdowns=breakdown,
    443                                              obj_dep=obj_dep.value,
--> 444                                              task_dep=task_dep.value)
    445         print("Opening html file in browser...")
    446 

~/Workspace/ray/python/ray/experimental/state.py in dump_catapult_trace(self, path, task_info, breakdowns, task_dep, obj_dep)
    748                             owner_task = self._object_table(arg)["TaskID"]
    749                             owner_worker = (workers[
--> 750                                 task_info[owner_task]["worker_id"]])
    751                             # Adding/subtracting 2 to the time associated with
    752                             # the beginning/ending of the flow event is

KeyError: 'be338b34b87c7497d6fd82768b5f3e3d5f4442c3'

E.g.,

import ray
ray.init()

@ray.remote
def f(x):
    return 1

x = 1
for _ in range(10 ** 5):
    x = f.remote(x)

Then visualize the last 10000 tasks.

Presumably this arises from omitting the initial scan of the entire task table.

@ericl
Copy link
Contributor Author

ericl commented Sep 10, 2017

Alright that should be fixed.

@robertnishihara
Copy link
Collaborator

I don't see any new commits, did you push something?

@ericl
Copy link
Contributor Author

ericl commented Sep 10, 2017 via email

@robertnishihara
Copy link
Collaborator

Looks good to me, I'll merge it once the tests run.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/1813/
Test PASSed.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/1814/
Test PASSed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants