-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Remove multiple core workers in one process 1/n. #24147
Conversation
Hey @jovany-wang do you plan to work on this? |
@rkooo567 Yes. I'm now on vacation, and will be back tomorrow. |
2828d23
to
264bf5a
Compare
NOTE: After the last version of 1.x being cut, we could remove |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More APIs to be removed:
- Ray.getAsyncContext
- Ray.setAsyncContext
- Ray.wrapRunnable`
- Ray.wrapCallable
Curious to me that you mentioned you want to remove APIs but you didn't touch the "java/api" directory.
And some implementation code in case you didn't notice (it's OK to remove them in a later PR):
- RayAsyncContextUpdater
- ClassLoaderTest
- MultiThreadingTest
- nativeSetCoreWorker
- CoreWorkerProcess::SetCurrentThreadWorkerId
- thread_local_core_worker_
- Search
number of workers
in core_worker_process.h and update comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clicked review twice. Ignore this comment.
1.13 is already cut (next release will be 2.0) so we should be able to merge it. |
Added. |
@@ -34,7 +33,7 @@ | |||
import org.slf4j.Logger; | |||
import org.slf4j.LoggerFactory; | |||
|
|||
/** Manages functions by job id. */ | |||
/** Manages functions globally. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/** Manages functions globally. */ | |
/** Manages functions in the current worker. */ |
globally sounds ambiguous.
workers_.erase(worker->GetWorkerID()); | ||
RAY_LOG(INFO) << "Removed worker " << worker->GetWorkerID(); | ||
} | ||
RAY_LOG(INFO) << "Removed worker " << worker->GetWorkerID(); | ||
if (global_worker_ == worker) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, is there case global_worker != worker?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No case that globalworker != worker. I'm going to refactor core worker process in the next PR.
int64_t data_size; | ||
int64_t metadata_size; | ||
int64_t data_size = 0; | ||
int64_t metadata_size = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These initializations are used to avoid the failure from UBSAN, the error message is:
[ RUN ] ObjectLifecycleManagerTest.RemoveReferenceTwoRef
--
| bazel-out/k8-opt/bin/_virtual_includes/plasma_store_server_lib/ray/object_manager/common.h:53:54: runtime error: signed integer overflow: 5921777712514757477 + 8315161630257148773 cannot be represented in type 'long int'
The cause is these 2 fields are not initialized and then the result of data_size + metadata_size
is overflow.
Why it doesn't happen on master?
Because this PR may change the memory view of the process. I tested on master, data_size
and metadata_size
are also big numbers but the sum of them is a little smaller than the max value of int64_t
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch. cc @scv119 can you check setting these to 0 & 0 is reasonable?
…ay-project#33074) Multiple workers in one process is removed in ray-project#24147 but there is still some dead code Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: Jack He <jackhe2345@gmail.com>
…ay-project#33074) Multiple workers in one process is removed in ray-project#24147 but there is still some dead code Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…ay-project#33074) Multiple workers in one process is removed in ray-project#24147 but there is still some dead code Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
…ay-project#33074) Multiple workers in one process is removed in ray-project#24147 but there is still some dead code Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: elliottower <elliot@elliottower.com>
…ay-project#33074) Multiple workers in one process is removed in ray-project#24147 but there is still some dead code Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: Jack He <jackhe2345@gmail.com>
Why are these changes needed?
This is the 1st PR to remove the code path of multiple core workers in one process. This PR is aiming to remove the flags and APIs related to
num_workers
.After this PR checking in, we needn't to consider the multiple core workers any longer.
The further following PRs are related to the deeper logic refactor, like eliminating the gap between core worker and core worker process, removing the logic related to multiple workers from workerpool, gcs and etc.
BREAK CHANGE
This PR removes these APIs:
And the following APIs are not allowed to invoke in a user-created thread in local mode:
Note that this PR shouldn't be merged to 1.x.
Related issue number
#21547 #24567
Checks
scripts/format.sh
to lint the changes in this PR.