Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Update Error Message and Anti-Pattern for the Case of Forking New Processes in Worker Processes #50705

Merged
merged 15 commits into from
Feb 22, 2025

Conversation

MengjinYan
Copy link
Collaborator

Why are these changes needed?

Recently, we investigated an object store message corruption issue and eventually found out that it is due to a misuse of the Ray library. The user forked multiple child processes from the driver process which leads to multiple processes sharing the same domain socket to the plasma store without any synchronization and thus the message corruption.

To reduce the time to investigate for similar issue and to help user better understand the anti-pattern, this PR:
(1) Update the error message to indicate the potential failure reason in the case of corrupted message
(2) Added the corresponding anti-pattern in the oss document

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

…in application code

Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
@MengjinYan MengjinYan requested a review from a team as a code owner February 19, 2025 01:38
@@ -41,6 +41,13 @@ using flatbuf::MessageType;
using flatbuf::ObjectSource;
using flatbuf::PlasmaError;

constexpr std::string_view DEBUG_STRING = "debug_string";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inline constexpr, otherwise UB

@@ -20,6 +20,7 @@
#include <utility>

#include "flatbuffers/flatbuffers.h"
#include "protocol.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bazel uses absolute path

Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
@pcmoritz
Copy link
Contributor

pcmoritz commented Feb 19, 2025

On the antipattern documentation: Using multiprocessing is fine (we need to support that), the problem is using multiprocessing with "fork" (which is unfortunately the default, but it seems to be changing with Python 3.14 -- python/cpython#84559). Using multiprocessing with "fork" generates many problems with lots of other third party libraries like pytorch and grpc.

Instead we should document how to use multiprocessing with "spawn" which works fine. We cannot expect people to rewrite their existing multiprocessing code with Ray tasks / actors -- that would be unreasonable. Switching from fork to spawn can be done with

import multiprocessing
multiprocessing.set_start_method("spawn")

Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
@MengjinYan MengjinYan added the go add ONLY when ready to merge, run all tests label Feb 20, 2025
@MengjinYan MengjinYan changed the title [Core] Update Error Message and Anti-Pattern for the Case of Creating New Processes in Worker Processes [Core] Update Error Message and Anti-Pattern for the Case of Forking New Processes in Worker Processes Feb 20, 2025
@MengjinYan MengjinYan requested review from jjyao and dayshah February 20, 2025 17:35
@MengjinYan
Copy link
Collaborator Author

On the antipattern documentation: Using multiprocessing is fine (we need to support that), the problem is using multiprocessing with "fork" (which is unfortunately the default, but it seems to be changing with Python 3.14 -- python/cpython#84559). Using multiprocessing with "fork" generates many problems with lots of other third party libraries like pytorch and grpc.

Instead we should document how to use multiprocessing with "spawn" which works fine. We cannot expect people to rewrite their existing multiprocessing code with Ray tasks / actors -- that would be unreasonable. Switching from fork to spawn can be done with

import multiprocessing
multiprocessing.set_start_method("spawn")

Good point! I've updated the anti-pattern to "fork" new processes and suggested to use spawn or leverage Ray itself.

@@ -41,6 +41,13 @@ using flatbuf::MessageType;
using flatbuf::ObjectSource;
using flatbuf::PlasmaError;

inline constexpr std::string_view DEBUG_STRING = "debug_string";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
inline constexpr std::string_view DEBUG_STRING = "debug_string";
inline constexpr std::string_view kDebugString = "debug_string";

Actually don't sure whether we need to define a constant here, seems it's just used in one place.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should also be defined in the protocol.cc file

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually don't sure whether we need to define a constant here, seems it's just used in one place.

Although it is only used in one place but the function can potentially be called multiple times, so I think it is still better to put it here like this.

Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
"This could be due to "
"process forking in core worker or driver code which results in multiple processes "
"sharing the same Plasma store socket. Please ensure that there are no "
"process forking in any of the application core worker or driver code.";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a link to the anti-pattern here so the user knows how to fix this?

@@ -46,6 +46,12 @@ enum class ObjectState : int {
PLASMA_SEALED = 2,
};

constexpr std::string_view kCorruptedRequestErrorMessage =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inline constexpr

Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>

@ray.remote
def generate_response(request):
return "Response to " + request
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return big numpy array

@@ -0,0 +1,24 @@
Anti-pattern: Forking new Processes in Application Code
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Anti-pattern: Forking new Processes in Application Code
Anti-pattern: Forking new processes in tasks or actors

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline. "Application Code" can include drivers, tasks, actors that inside the ray context. So will keep the "Application Code" in the subject and clarify it in the doc.

Anti-pattern: Forking new Processes in Application Code
========================================================

**Summary:** Don't fork new processes in application code. Instead, use "spawn" method
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**Summary:** Don't fork new processes in application code. Instead, use "spawn" method
**Summary:** Don't fork new processes in tasks or actors. Instead, use "spawn" method

@@ -27,3 +27,4 @@ This section is a collection of common design patterns and anti-patterns for wri
closure-capture-large-objects
global-variables
out-of-band-object-ref-serialization
create-new-processes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
create-new-processes
fork-new-processes

"sharing the same Plasma store socket. Please ensure that there are no "
"process forking in any of the application core worker or driver code. Follow the "
"link here to learn more about the issue and how to fix it: "
"https://docs.ray.io/en/latest/ray-core/patterns/create-new-processes.html";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"https://docs.ray.io/en/latest/ray-core/patterns/create-new-processes.html";
"https://docs.ray.io/en/latest/ray-core/patterns/fork-new-processes.html";

Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
@jjyao jjyao requested a review from pcmoritz February 21, 2025 18:41
@MengjinYan
Copy link
Collaborator Author

@dayshah Can you help to review the PR from the documentation's perspective? Thanks!

@jjyao jjyao enabled auto-merge (squash) February 21, 2025 22:53
@github-actions github-actions bot disabled auto-merge February 21, 2025 22:53
@jjyao jjyao enabled auto-merge (squash) February 22, 2025 00:14
@jjyao jjyao merged commit 2a85cef into ray-project:master Feb 22, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants