Skip to content

Conversation

@njhill
Copy link
Member

@njhill njhill commented Mar 31, 2024

There's no need for the parallel workers to be scheduled in every step.

Using 80GB A100s, with llama-2-7b openai completion API. Single request with 5 input tokens, 2000 generated tokens. I repeated each test request multiple times, results were very consistent.

Time (sec) Difference
TP=1 24.2 0
TP=2 24.2 0
TP=2 with this PR 20.2 -17%
TP=2 with #3466 (without Ray) 17.0 -30%
TP=2 with this PR combined with #3466 (without Ray) 16.7 -31%

Though the relative improvement from this is much smaller in the non-Ray case, it might still be helpful for multi-node with Ray.

@njhill njhill changed the title [Core] Eliminate parallel worker inter-token scheduling overhead [Core] Eliminate parallel worker inter-token task scheduling overhead Apr 1, 2024
@njhill njhill changed the title [Core] Eliminate parallel worker inter-token task scheduling overhead [Core] Eliminate parallel worker per-step task scheduling overhead Apr 1, 2024
@njhill
Copy link
Member Author

njhill commented Apr 1, 2024

I measured about 5% reduction in latency with this change for a single request with 5 input toks, 1000 output toks with TP=4 llama-2-70b.

@WoosukKwon
Copy link
Collaborator

@zhuohan123 Kindly ping for PR review.

Copy link
Collaborator

@Yard1 Yard1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks pretty good!

Copy link
Member

@zhuohan123 zhuohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! Left some small comments for coding style.

One additional question, do you think it's possible to let the workers be in the model loop forever (so we don't need to call the current halt_model function)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A slightly more accurate name:

Suggested change
async_remote_only: bool = False,
remote_worker_only: bool = False,

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhuohan123 I've renamed it for now to remote_workers_only_async because in addition to only running on the remote workers, it also changes the behavior to not wait for the responses and return the future(s) rather than resulting outputs.

Let me know if you would still prefer to drop the _async though (or if you can think of a better name... maybe start_remote_workers_only?) and I'll update again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly a more straight-forward name:

Suggested change
def halt_model(self) -> None:
def stop_remote_worker_execution_loop(self) -> None:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made this change, but the reason for the original name is that it's defined in the abstract base class where there is no concept of a remote worker. "halt_model" seemed like a more abstract way of describing it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def execute_model_parallel(self) -> None:
def start_worker_execution_loop(self) -> None:

And we can rename the original execute_model -> driver_execute_model

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhuohan123 I didn't make this change yet because it's actually called from a lot of places including from a number of places in the speculative decoding logic. Please confirm and I can change it everywhere.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will return finished outputs twice? LLM object will get duplicate output of same request?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hengxinCheung sorry, I'm not sure I understand the question. This PR doesn't change anything w.r.t. how many outputs are returned.

Copy link

@Hanson-zh Hanson-zh Apr 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sorry for cofusing you. Let me provide a more detailed description. For example, request A marked as finished in the current execution, but it will be scheduled in the next step. So this request will return last generated text twice? I will carefully read your implementation again. Thanks your reply.

@njhill
Copy link
Member Author

njhill commented Apr 9, 2024

Thanks for the review @zhuohan123 and great comments. I have address most of them but have couple of small follow-up questions, ptal!

One additional question, do you think it's possible to let the workers be in the model loop forever (so we don't need to call the current halt_model function)?

I'd been considering this, but I'm not sure that NCCL is intended to be used in this way, i.e. blocking indefinitely in an event loop. So we could run into unexpected issues. Apart from this, there are some consequences we'd have to address:

  • nccl has a timeout, but unless you run in a sub-optimal synchronous mode, it's always treated as fatal. We could possibly disable the timeout, but that might be undesirable. Or we could ensure that the driver wakes up the workers on some interval smaller than the timeout.
  • I think the procs are essentially frozen while in the collective op, so no other threads can run. This might be the GIL being held. So further changes might then be needed to accommodate other control plane operations, in particular add/remove loras. If we went this route we may also want to handle these via the same broadcast + gather mechanism.

Given the above, I thought the current PR changes would make more sense as a first incremental change. But I do like the idea of avoiding the secondary RPC path altogether. Perhaps gloo could be instead used for the event loop along the lines of what @youkaichao has been looking at.

@youkaichao
Copy link
Member

@njhill FYI #3904 just adds a gloo backend by default. That process group is available as _CPU_WORLD_GROUP. I will create some APIs for getting the group.

njhill added 5 commits April 15, 2024 13:44
There's no need for the parallel workers to be scheduled each step.
So that any errors are still propagated properly
Default behaviour is no-op (single GPU)
Default behaviour is no-op (single GPU)
@njhill
Copy link
Member Author

njhill commented May 18, 2024

@zhuohan123 I've opened #4894 to replace this, now applies to both Ray and Multiprocessing executor implementations. PTAL!

@njhill njhill closed this May 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants