Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Multiprocessing executor for single-node multi-GPU [1/2] #4345

Closed
wants to merge 2 commits into from

Conversation

njhill
Copy link
Member

@njhill njhill commented Apr 25, 2024

This introduces the MultiProcGPUExecutor which uses multiprocessing for tensor parallel as an alternative to ray.

This PR does not actually wire it up for use, that will be done in a follow-on PR.

This PR includes some refactoring to simplify the executor class hierarchy:

  • A MultiGPUExecutor abstract superclass shared between ray and vanilla multiprocessing implementations
  • Add a shutdown() method to BaseExecutor abstract class, executor is shutdown when the LLMEngine is garbage collected
  • Simplification/centralization of GPU Worker construction
  • Move ray_utils.py from engine to executor package (per @zhuohan123's suggestion)
  • Move function call tracing setup to utils function
  • Fix various typing things

This replaces #3466, see background in that issue.

This introduces the MultiProcGPUExecutor which uses multiprocessing for tensor parallel as an alternative to ray.

This PR does not actually wire it up for use, that will be done in a follow-on PR.

This PR includes some refactoring to simplify the executor class hierarchy:
- A `MultiGPUExecutor` abstract superclass shared between ray and vanilla multiprocessing implementations
- Add a shutdown() method to BaseExecutor abstract class
- Simplification/centralization of GPU Worker construction
- Move ray_utils.py from engine to executor package
- Move function call tracing setup to utils function
- Fix various typing things
Comment on lines +271 to +274
@lru_cache(maxsize=None)
def get_distributed_init_method() -> str:
ip = get_ip()
port = get_open_port()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it safe to cache here? what if we want to init distributed for two different groups, and we expect the function to return different results per call?

vllm/utils.py Outdated
@@ -607,3 +614,15 @@ def find_nccl_library():
raise ValueError("NCCL only supports CUDA and ROCm backends.")
logger.info(f"Found nccl from library {so_file}")
return so_file


def enable_trace_function_call_for_process():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually this is per-thread. the naming is not accurate.

@youkaichao
Copy link
Member

To be honest, I feel like this PR is still too large to be effective reviewed. I suggest to break it down into pieces, each pr having about 100~200 lines of change, so that we can have a quick feedback loop.

@njhill
Copy link
Member Author

njhill commented Apr 25, 2024

@youkaichao here is the first one: #4347

@njhill
Copy link
Member Author

njhill commented Apr 25, 2024

Next one, just introducing the new multi-gpu abstract executor class: #4348

@njhill
Copy link
Member Author

njhill commented Apr 25, 2024

Next one, add shutdown method to ExecutorBase: #4349

@youkaichao
Copy link
Member

Let's keep 1~2 PRs per day to avoid zhuohan and me being overwhelmed ?

@njhill
Copy link
Member Author

njhill commented Apr 25, 2024

@youkaichao these are mostly the contents of the original PR that was already reviewed and discussed, just broken up (and mostly very small). Please feel to review at whatever pace you're comfortable with!

Next one is just moving the function tracing setup to a util function: #4352

@njhill njhill marked this pull request as draft April 25, 2024 06:07
@njhill
Copy link
Member Author

njhill commented Apr 25, 2024

Closing this in favor of a set of more granular PRs, some already opened (linked above).

@njhill njhill closed this Apr 25, 2024
@njhill njhill deleted the ray-optional3 branch May 15, 2024 22:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants