[Core] Update Error Message and Anti-Pattern for the Case of Forking New Processes in Worker Processes #50705

MengjinYan · 2025-02-19T01:38:56Z

Why are these changes needed?

Recently, we investigated an object store message corruption issue and eventually found out that it is due to a misuse of the Ray library. The user forked multiple child processes from the driver process which leads to multiple processes sharing the same domain socket to the plasma store without any synchronization and thus the message corruption.

To reduce the time to investigate for similar issue and to help user better understand the anti-pattern, this PR:
(1) Update the error message to indicate the potential failure reason in the case of corrupted message
(2) Added the corresponding anti-pattern in the oss document

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

…in application code Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>

dentiny · 2025-02-19T01:49:45Z

src/ray/object_manager/plasma/protocol.h

@@ -41,6 +41,13 @@ using flatbuf::MessageType;
 using flatbuf::ObjectSource;
 using flatbuf::PlasmaError;

+constexpr std::string_view DEBUG_STRING = "debug_string";


inline constexpr, otherwise UB

dentiny · 2025-02-19T01:49:54Z

src/ray/object_manager/plasma/protocol.cc

@@ -20,6 +20,7 @@
 #include <utility>

 #include "flatbuffers/flatbuffers.h"
+#include "protocol.h"


bazel uses absolute path

Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>

pcmoritz · 2025-02-19T02:20:49Z

On the antipattern documentation: Using multiprocessing is fine (we need to support that), the problem is using multiprocessing with "fork" (which is unfortunately the default, but it seems to be changing with Python 3.14 -- python/cpython#84559). Using multiprocessing with "fork" generates many problems with lots of other third party libraries like pytorch and grpc.

Instead we should document how to use multiprocessing with "spawn" which works fine. We cannot expect people to rewrite their existing multiprocessing code with Ray tasks / actors -- that would be unreasonable. Switching from fork to spawn can be done with

import multiprocessing
multiprocessing.set_start_method("spawn")

Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>

MengjinYan · 2025-02-20T17:37:38Z

On the antipattern documentation: Using multiprocessing is fine (we need to support that), the problem is using multiprocessing with "fork" (which is unfortunately the default, but it seems to be changing with Python 3.14 -- python/cpython#84559). Using multiprocessing with "fork" generates many problems with lots of other third party libraries like pytorch and grpc.

Instead we should document how to use multiprocessing with "spawn" which works fine. We cannot expect people to rewrite their existing multiprocessing code with Ray tasks / actors -- that would be unreasonable. Switching from fork to spawn can be done with
import multiprocessing
multiprocessing.set_start_method("spawn")

Good point! I've updated the anti-pattern to "fork" new processes and suggested to use spawn or leverage Ray itself.

src/ray/object_manager/plasma/common.h

jjyao · 2025-02-20T21:59:35Z

src/ray/object_manager/plasma/protocol.h

@@ -41,6 +41,13 @@ using flatbuf::MessageType;
 using flatbuf::ObjectSource;
 using flatbuf::PlasmaError;

+inline constexpr std::string_view DEBUG_STRING = "debug_string";


Suggested change

inline constexpr std::string_view DEBUG_STRING = "debug_string";

inline constexpr std::string_view kDebugString = "debug_string";

Actually don't sure whether we need to define a constant here, seems it's just used in one place.

It should also be defined in the protocol.cc file

Actually don't sure whether we need to define a constant here, seems it's just used in one place.

Although it is only used in one place but the function can potentially be called multiple times, so I think it is still better to put it here like this.

Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>

pcmoritz · 2025-02-20T22:35:59Z

src/ray/object_manager/plasma/common.h

+    "This could be due to "
+    "process forking in core worker or driver code which results in multiple processes "
+    "sharing the same Plasma store socket. Please ensure that there are no "
+    "process forking in any of the application core worker or driver code.";


Can you add a link to the anti-pattern here so the user knows how to fix this?

dentiny · 2025-02-21T00:48:19Z

src/ray/object_manager/plasma/common.h

@@ -46,6 +46,12 @@ enum class ObjectState : int {
  PLASMA_SEALED = 2,
 };

+constexpr std::string_view kCorruptedRequestErrorMessage =


inline constexpr

Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>

jjyao · 2025-02-21T17:48:51Z

doc/source/ray-core/doc_code/anti_pattern_create_new_processes.py

+
+@ray.remote
+def generate_response(request):
+    return "Response to " + request


return big numpy array

jjyao · 2025-02-21T17:50:38Z

doc/source/ray-core/patterns/create-new-processes.rst

@@ -0,0 +1,24 @@
+Anti-pattern: Forking new Processes in Application Code


Suggested change

Anti-pattern: Forking new Processes in Application Code

Anti-pattern: Forking new processes in tasks or actors

Discussed offline. "Application Code" can include drivers, tasks, actors that inside the ray context. So will keep the "Application Code" in the subject and clarify it in the doc.

jjyao · 2025-02-21T17:50:56Z

doc/source/ray-core/patterns/create-new-processes.rst

+Anti-pattern: Forking new Processes in Application Code
+========================================================
+
+**Summary:** Don't fork new processes in application code. Instead, use "spawn" method 


Suggested change

**Summary:** Don't fork new processes in application code. Instead, use "spawn" method

**Summary:** Don't fork new processes in tasks or actors. Instead, use "spawn" method

jjyao · 2025-02-21T17:51:42Z

doc/source/ray-core/patterns/index.rst

@@ -27,3 +27,4 @@ This section is a collection of common design patterns and anti-patterns for wri
    closure-capture-large-objects
    global-variables
    out-of-band-object-ref-serialization
+    create-new-processes


Suggested change

create-new-processes

fork-new-processes

jjyao · 2025-02-21T17:51:58Z

src/ray/object_manager/plasma/common.h

+    "sharing the same Plasma store socket. Please ensure that there are no "
+    "process forking in any of the application core worker or driver code. Follow the "
+    "link here to learn more about the issue and how to fix it: "
+    "https://docs.ray.io/en/latest/ray-core/patterns/create-new-processes.html";


Suggested change

"https://docs.ray.io/en/latest/ray-core/patterns/create-new-processes.html";

"https://docs.ray.io/en/latest/ray-core/patterns/fork-new-processes.html";

Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>

MengjinYan · 2025-02-21T22:44:13Z

@dayshah Can you help to review the PR from the documentation's perspective? Thanks!

…e-730

Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>

update error message and add anti-pattern for creating new processed …

d9db3be

…in application code Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>

MengjinYan requested a review from a team as a code owner February 19, 2025 01:38

dentiny reviewed Feb 19, 2025

View reviewed changes

fix pre-commit failure

34e33b8

Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>

fix review comment

741f6b4

Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>

MengjinYan added the go add ONLY when ready to merge, run all tests label Feb 20, 2025

MengjinYan changed the title ~~[Core] Update Error Message and Anti-Pattern for the Case of Creating New Processes in Worker Processes~~ [Core] Update Error Message and Anti-Pattern for the Case of Forking New Processes in Worker Processes Feb 20, 2025

MengjinYan requested review from jjyao and dayshah February 20, 2025 17:35

MengjinYan assigned jjyao and dayshah Feb 20, 2025

jjyao reviewed Feb 20, 2025

View reviewed changes

Update src/ray/object_manager/plasma/common.h

3ba379d

Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>

pcmoritz reviewed Feb 20, 2025

View reviewed changes

dentiny reviewed Feb 21, 2025

View reviewed changes

MengjinYan added 5 commits February 20, 2025 17:22

Merge remote-tracking branch 'upstream/master' into core-730

19487f7

fix review comments

697dcae

Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>

merge conflict

7cb0dab

Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>

fix typo

3843902

Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>

fix typo

7c50d6d

Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>

jjyao approved these changes Feb 21, 2025

View reviewed changes

MengjinYan added 2 commits February 21, 2025 10:38

fix review comments

d6005f5

Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>

fix the uri

f7162b5

Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>

jjyao requested a review from pcmoritz February 21, 2025 18:41

pcmoritz approved these changes Feb 21, 2025

View reviewed changes

dayshah approved these changes Feb 21, 2025

View reviewed changes

Merge branch 'master' into core-730

c592821

jjyao enabled auto-merge (squash) February 21, 2025 22:53

github-actions bot disabled auto-merge February 21, 2025 22:53

MengjinYan added 3 commits February 21, 2025 14:57

Merge remote-tracking branch 'upstream/master' into core-730

1d6352c

Merge branch 'core-730' of https://github.com/MengjinYan/ray into cor…

562ed14

…e-730

fix subject format

d338f31

Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>

jjyao enabled auto-merge (squash) February 22, 2025 00:14

jjyao merged commit 2a85cef into ray-project:master Feb 22, 2025
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Update Error Message and Anti-Pattern for the Case of Forking New Processes in Worker Processes #50705

[Core] Update Error Message and Anti-Pattern for the Case of Forking New Processes in Worker Processes #50705

MengjinYan commented Feb 19, 2025

dentiny Feb 19, 2025

dentiny Feb 19, 2025

pcmoritz commented Feb 19, 2025 •

edited

Loading

MengjinYan commented Feb 20, 2025

jjyao Feb 20, 2025

jjyao Feb 20, 2025

MengjinYan Feb 21, 2025

pcmoritz Feb 20, 2025

dentiny Feb 21, 2025

jjyao Feb 21, 2025

jjyao Feb 21, 2025

MengjinYan Feb 21, 2025

jjyao Feb 21, 2025

jjyao Feb 21, 2025

jjyao Feb 21, 2025

MengjinYan commented Feb 21, 2025

	inline constexpr std::string_view DEBUG_STRING = "debug_string";
	inline constexpr std::string_view kDebugString = "debug_string";

		@@ -0,0 +1,24 @@
		Anti-pattern: Forking new Processes in Application Code

	Anti-pattern: Forking new Processes in Application Code
	Anti-pattern: Forking new processes in tasks or actors

	Summary: Don't fork new processes in application code. Instead, use "spawn" method
	Summary: Don't fork new processes in tasks or actors. Instead, use "spawn" method

	"https://docs.ray.io/en/latest/ray-core/patterns/create-new-processes.html";
	"https://docs.ray.io/en/latest/ray-core/patterns/fork-new-processes.html";

[Core] Update Error Message and Anti-Pattern for the Case of Forking New Processes in Worker Processes #50705

[Core] Update Error Message and Anti-Pattern for the Case of Forking New Processes in Worker Processes #50705

Conversation

MengjinYan commented Feb 19, 2025

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pcmoritz commented Feb 19, 2025 • edited Loading

MengjinYan commented Feb 20, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MengjinYan commented Feb 21, 2025

pcmoritz commented Feb 19, 2025 •

edited

Loading