Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Remove multiple core workers in one process 1/n. #24147

Merged
merged 27 commits into from
May 18, 2022

Conversation

jovany-wang
Copy link
Contributor

@jovany-wang jovany-wang commented Apr 24, 2022

Why are these changes needed?

This is the 1st PR to remove the code path of multiple core workers in one process. This PR is aiming to remove the flags and APIs related to num_workers.
After this PR checking in, we needn't to consider the multiple core workers any longer.

The further following PRs are related to the deeper logic refactor, like eliminating the gap between core worker and core worker process, removing the logic related to multiple workers from workerpool, gcs and etc.

BREAK CHANGE
This PR removes these APIs:

  • Ray.wrapRunnable();
  • Ray.wrapCallable();
  • Ray.setAsyncContext();
  • Ray.getAsyncContext();

And the following APIs are not allowed to invoke in a user-created thread in local mode:

  • Ray.getRuntimeContext().getCurrentActorId();
  • Ray.getRuntimeContext().getCurrentTaskId()

Note that this PR shouldn't be merged to 1.x.

Related issue number

#21547 #24567

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@rkooo567
Copy link
Contributor

rkooo567 commented May 3, 2022

Hey @jovany-wang do you plan to work on this?

@jovany-wang
Copy link
Contributor Author

@rkooo567 Yes. I'm now on vacation, and will be back tomorrow.

@jovany-wang jovany-wang added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label May 4, 2022
@jovany-wang jovany-wang force-pushed the arch-cleaning branch 2 times, most recently from 2828d23 to 264bf5a Compare May 6, 2022 08:44
@jovany-wang jovany-wang changed the title [WIP] Remove multi core workers in one process. [Core] Remove multiple core workers in one process 1/n. May 6, 2022
@jovany-wang jovany-wang added do-not-merge Do not merge this PR! and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels May 6, 2022
@jovany-wang
Copy link
Contributor Author

NOTE: After the last version of 1.x being cut, we could remove don't-merge label.

@jovany-wang
Copy link
Contributor Author

@kfstorm @rkooo567 This is ready for review now.

Copy link
Member

@kfstorm kfstorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More APIs to be removed:

  • Ray.getAsyncContext
  • Ray.setAsyncContext
  • Ray.wrapRunnable`
  • Ray.wrapCallable

Curious to me that you mentioned you want to remove APIs but you didn't touch the "java/api" directory.

And some implementation code in case you didn't notice (it's OK to remove them in a later PR):

  • RayAsyncContextUpdater
  • ClassLoaderTest
  • MultiThreadingTest
  • nativeSetCoreWorker
  • CoreWorkerProcess::SetCurrentThreadWorkerId
  • thread_local_core_worker_
  • Search number of workers in core_worker_process.h and update comments.

src/ray/core_worker/core_worker.cc Outdated Show resolved Hide resolved
src/ray/raylet/worker_pool.cc Outdated Show resolved Hide resolved
Copy link
Member

@kfstorm kfstorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clicked review twice. Ignore this comment.

@jjyao
Copy link
Collaborator

jjyao commented May 6, 2022

1.13 is already cut (next release will be 2.0) so we should be able to merge it.

@jovany-wang jovany-wang added Ray 2.0 and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels May 11, 2022
@jovany-wang
Copy link
Contributor Author

It will be also great if you can write what APIs are removed from the PR description (for the future reference)!

Added.

@@ -34,7 +33,7 @@
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/** Manages functions by job id. */
/** Manages functions globally. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/** Manages functions globally. */
/** Manages functions in the current worker. */

globally sounds ambiguous.

workers_.erase(worker->GetWorkerID());
RAY_LOG(INFO) << "Removed worker " << worker->GetWorkerID();
}
RAY_LOG(INFO) << "Removed worker " << worker->GetWorkerID();
if (global_worker_ == worker) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, is there case global_worker != worker?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No case that globalworker != worker. I'm going to refactor core worker process in the next PR.

int64_t data_size;
int64_t metadata_size;
int64_t data_size = 0;
int64_t metadata_size = 0;
Copy link
Contributor Author

@jovany-wang jovany-wang May 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These initializations are used to avoid the failure from UBSAN, the error message is:

[ RUN      ] ObjectLifecycleManagerTest.RemoveReferenceTwoRef
--
  | bazel-out/k8-opt/bin/_virtual_includes/plasma_store_server_lib/ray/object_manager/common.h:53:54: runtime error: signed integer overflow: 5921777712514757477 + 8315161630257148773 cannot be represented in type 'long int'

The cause is these 2 fields are not initialized and then the result of data_size + metadata_size is overflow.

Why it doesn't happen on master?
Because this PR may change the memory view of the process. I tested on master, data_size and metadata_size are also big numbers but the sum of them is a little smaller than the max value of int64_t.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. cc @scv119 can you check setting these to 0 & 0 is reasonable?

@jovany-wang jovany-wang merged commit eb29895 into ray-project:master May 18, 2022
@jovany-wang jovany-wang deleted the arch-cleaning branch May 18, 2022 16:36
jjyao added a commit that referenced this pull request Mar 7, 2023
…33074)

Multiple workers in one process is removed in #24147 but there is still some dead code

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request Mar 21, 2023
…ay-project#33074)

Multiple workers in one process is removed in ray-project#24147 but there is still some dead code

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jack He <jackhe2345@gmail.com>
edoakes pushed a commit to edoakes/ray that referenced this pull request Mar 22, 2023
…ay-project#33074)

Multiple workers in one process is removed in ray-project#24147 but there is still some dead code

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
peytondmurray pushed a commit to peytondmurray/ray that referenced this pull request Mar 22, 2023
…ay-project#33074)

Multiple workers in one process is removed in ray-project#24147 but there is still some dead code

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
elliottower pushed a commit to elliottower/ray that referenced this pull request Apr 22, 2023
…ay-project#33074)

Multiple workers in one process is removed in ray-project#24147 but there is still some dead code

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: elliottower <elliot@elliottower.com>
ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request May 4, 2023
…ay-project#33074)

Multiple workers in one process is removed in ray-project#24147 but there is still some dead code

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jack He <jackhe2345@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants