-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Update Error Message and Anti-Pattern for the Case of Forking New Processes in Worker Processes #50705
Conversation
…in application code Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
@@ -41,6 +41,13 @@ using flatbuf::MessageType; | |||
using flatbuf::ObjectSource; | |||
using flatbuf::PlasmaError; | |||
|
|||
constexpr std::string_view DEBUG_STRING = "debug_string"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inline constexpr
, otherwise UB
@@ -20,6 +20,7 @@ | |||
#include <utility> | |||
|
|||
#include "flatbuffers/flatbuffers.h" | |||
#include "protocol.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bazel uses absolute path
Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
On the antipattern documentation: Using multiprocessing is fine (we need to support that), the problem is using multiprocessing with "fork" (which is unfortunately the default, but it seems to be changing with Python 3.14 -- python/cpython#84559). Using multiprocessing with "fork" generates many problems with lots of other third party libraries like pytorch and grpc. Instead we should document how to use multiprocessing with "spawn" which works fine. We cannot expect people to rewrite their existing multiprocessing code with Ray tasks / actors -- that would be unreasonable. Switching from fork to spawn can be done with import multiprocessing
multiprocessing.set_start_method("spawn") |
Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
Good point! I've updated the anti-pattern to "fork" new processes and suggested to use spawn or leverage Ray itself. |
@@ -41,6 +41,13 @@ using flatbuf::MessageType; | |||
using flatbuf::ObjectSource; | |||
using flatbuf::PlasmaError; | |||
|
|||
inline constexpr std::string_view DEBUG_STRING = "debug_string"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inline constexpr std::string_view DEBUG_STRING = "debug_string"; | |
inline constexpr std::string_view kDebugString = "debug_string"; |
Actually don't sure whether we need to define a constant here, seems it's just used in one place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should also be defined in the protocol.cc file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually don't sure whether we need to define a constant here, seems it's just used in one place.
Although it is only used in one place but the function can potentially be called multiple times, so I think it is still better to put it here like this.
Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
"This could be due to " | ||
"process forking in core worker or driver code which results in multiple processes " | ||
"sharing the same Plasma store socket. Please ensure that there are no " | ||
"process forking in any of the application core worker or driver code."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a link to the anti-pattern here so the user knows how to fix this?
@@ -46,6 +46,12 @@ enum class ObjectState : int { | |||
PLASMA_SEALED = 2, | |||
}; | |||
|
|||
constexpr std::string_view kCorruptedRequestErrorMessage = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inline constexpr
Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
|
||
@ray.remote | ||
def generate_response(request): | ||
return "Response to " + request |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return big numpy array
@@ -0,0 +1,24 @@ | |||
Anti-pattern: Forking new Processes in Application Code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anti-pattern: Forking new Processes in Application Code | |
Anti-pattern: Forking new processes in tasks or actors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline. "Application Code" can include drivers, tasks, actors that inside the ray context. So will keep the "Application Code" in the subject and clarify it in the doc.
Anti-pattern: Forking new Processes in Application Code | ||
======================================================== | ||
|
||
**Summary:** Don't fork new processes in application code. Instead, use "spawn" method |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
**Summary:** Don't fork new processes in application code. Instead, use "spawn" method | |
**Summary:** Don't fork new processes in tasks or actors. Instead, use "spawn" method |
@@ -27,3 +27,4 @@ This section is a collection of common design patterns and anti-patterns for wri | |||
closure-capture-large-objects | |||
global-variables | |||
out-of-band-object-ref-serialization | |||
create-new-processes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
create-new-processes | |
fork-new-processes |
"sharing the same Plasma store socket. Please ensure that there are no " | ||
"process forking in any of the application core worker or driver code. Follow the " | ||
"link here to learn more about the issue and how to fix it: " | ||
"https://docs.ray.io/en/latest/ray-core/patterns/create-new-processes.html"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"https://docs.ray.io/en/latest/ray-core/patterns/create-new-processes.html"; | |
"https://docs.ray.io/en/latest/ray-core/patterns/fork-new-processes.html"; |
Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
@dayshah Can you help to review the PR from the documentation's perspective? Thanks! |
Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
Why are these changes needed?
Recently, we investigated an object store message corruption issue and eventually found out that it is due to a misuse of the Ray library. The user forked multiple child processes from the driver process which leads to multiple processes sharing the same domain socket to the plasma store without any synchronization and thus the message corruption.
To reduce the time to investigate for similar issue and to help user better understand the anti-pattern, this PR:
(1) Update the error message to indicate the potential failure reason in the case of corrupted message
(2) Added the corresponding anti-pattern in the oss document
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.