-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[xray] Implement timeline and profiling API. #2306
Conversation
Test PASSed. |
Test PASSed. |
Test PASSed. |
Test PASSed. |
Test PASSed. |
Test PASSed. |
Test PASSed. |
Test PASSed. |
Test PASSed. |
Test PASSed. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test PASSed. |
This PR decreases performance for %time l = ray.get([f.remote() for _ in range(10000)]) from about 1.3s to about 1.6s. I think a natural solution to this is to do more batching (in particular to flush from workers on a timer instead of after every task). However, that runs into some of the thread-safety issues brought up in #2248, so I may put that on hold for now. |
Test FAILed. |
jenkins, retest this please |
python/ray/experimental/state.py
Outdated
def _profile_table(self, component_id): | ||
"""Get the profile events for a given component. | ||
|
||
TODO(rkn): This method should support limiting the number of log events |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODOs should not be part of the doc string (user documentation)
python/ray/experimental/state.py
Outdated
web browser and load the dumped file. Make sure to enable "Flow events" | ||
in the "View Options" menu. | ||
|
||
TODO(rkn): Support including the task specification data in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODOs should not be part of the docstring
Test PASSed. |
Test PASSed. |
/// Store some profile events in the GCS. | ||
/// | ||
/// \param conn The connection information. | ||
/// \param event_type The type of the event. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be updated
src/ray/raylet/CMakeLists.txt
Outdated
DEPENDS ${FBS_DEPENDS} | ||
COMMENT "Running flatc compiler on ${NODE_MANAGER_FBS_SRC}" | ||
VERBATIM) | ||
|
||
#add_custom_target(gen_node_manager_fbs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this
Test PASSed. |
hi @robertnishihara, it looks like that For example, if Lines 2733 to 2737 in 4ef9d15
Is my understanding correct? If so, I can file a PR to fix this. Another quick question, is there any reason why only raylet uses background thread? |
Just found that the background thread is only used for drivers. Why do workers not use it? |
@raulchen you're right, I think it should be protected by a lock. It's only used in the raylet because I don't think we should add new functionality or change the behavior for the legacy code path. It's currently only used on the driver, but I think it actually makes sense to use in workers as well. Basically, each worker/driver could just have a background thread that calls Line 1006 in 4ef9d15
|
# Note(rkn): This is run on a background thread in the driver. It uses the | ||
# local scheduler client. This should be ok because it doesn't read from | ||
# the local scheduler client and we have the GIL here. However, if either | ||
# of those things changes, then we could run into issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After doing some research in depth, I believe this is still problematic. The reason is because the get_task
function isn't protected by by the GIL See
Py_BEGIN_ALLOW_THREADS |
Here's the more detailed explanation of how things can go wrong:
- the main thread executes the
get_task
function and switches out when it's in the middle ofwrite_message
. (the request is half-sent now) - then the flush-profile-data thread also calls into
write_message
viapush_profile_events
, and also write data to the connection. - then the server will receive a corrupted request.
This PR adds more detailed profiling information to the timeline for xray and allows you to view the timeline visualization for xray. It also improves the profiling API (for xray).
Example usage. To get a json dump of the profiling information that can be loaded in chrome://tracing, simply do
The dump will include profiling information for tasks (including the prior breakdown into deserializing arguments, executing the function, getting the results) as well as
ray.put
,ray.get
,ray.wait
, task submission, and various other little things. Profiling information will also be included from the driver, and this PR allows for the possibility of creating profiling events from any process that has a GCS client.To add a custom span to the timeline, simply do
Then this will appear in the timeline. This can be done within a task or on the driver. An example timeline is attached.
Not implemented in this PR
TODO in this PR maybe:
ProfileTableData
flatbuffer definition.AddProfileEvent
method.cc @alanamarzoev