-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add seperate prefill detokenization thread #152
Conversation
676a90e
to
2088c2c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, just some clarification question/suggestion.
2088c2c
to
427e771
Compare
Thank you, Pate! |
:-/ |
I will make sure test pass before submission. Sorry for the mistake. |
427e771
to
9b5cdfb
Compare
Looks good. What improvement do you see with this change? |
it's in the analysis doc. |
Hi Zhihao, Pate, and Vipan, I am afraid that this PR may disrupt the correct detokenization order given a certain request. Please let me know if my understanding is incorrect. The proposed change creates separate detokenization backlogs: P backlogs for prefill engines and G backlogs for generation engines. For each active request, the first token (from the prefill thread) and subsequent tokens (from the generate thread) end up in different backlogs. This separation could lead to incorrect ordering. To maintain proper token ordering, all tokens from a single request should be processed through the same backlog before being returned via the request's channel. Consider this edge case: if Python's thread scheduler never executes the detokenization threads that read from the prefill-detokenization backlogs, only the tokens from generate threads would be returned. This would result in missing first tokens for all requests. cc. @changlan |
Thank you, Zhihao, for the clarification! It is very helpful! How about merging the P prefill detokenization backlogs and the G generation detokenization backlogs into a single set of N detokenization backlogs? These backlogs will be associated with N detokenization threads? For any active request r, any prefill or generation thread processing r simply adds the output token IDs to the i-th of these N tokenization backlogs, where i is calculated as the hash of r modulo N. This ensures that all response tokens for r go to the same detokenization backlog. This also ensures that all tokens for r arrive in the correct order. |
Thank you for the quick response and suggestion, Yi! This is definitely a great idea! I will try it. Currently, I have observed that the JetStream runtime get stuck occasionally for several seconds and then restore to normal (when I do the disaggregated serving via 1 prefill engine and 1 generate/decode engine). I ran the py-spy and saw the GIL is hold by one thread during the stuck and seems not really do anything. I am just afraid that introducing more threads will make it worse. (as it seems the N is scaled with the QPS throughput) Maybe the above option #2 with running detokenization in the process instead thread is better. I will evaluate them. |
Thank you, Zhihao, for the context and the quick reaction! Looking forward to your evaluation result! |
Hi Zhihao and Vipan, just following up on this issue - do we have a conclusion for this, or has the change been reverted already? It is blocking our update due to potential incorrect results. Thanks! |
@sixiang-google, can you take a look? |
Add a seperate prefill detokenization thread to make sure the detokenization of prefill and decode is not blocked with each other.