-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
using bigdl-llm fused rope for llama #9066
Conversation
q_embed = torch.empty(query_states.shape, dtype=query_states.dtype, device=query_states.device) | ||
k_embed = torch.empty(key_states.shape, dtype=key_states.dtype, device=key_states.device) | ||
|
||
linear_q4_0.apply_rotary_embedding_half_qk(query_states, key_states, position_ids, q_embed, k_embed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this only works on GPU, yes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, only on GPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think cpu ipex has a similar function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think cpu ipex has a similar function.
We can add our GPU optimizations first
I think I am going to separate rms_norm, rope and other changes into different PRs with rms_norm being the first. |
performance updated here https://github.com/analytics-zoo/nano/issues/606 @jason-dai would you mind reviewing again? |
These optimizations does not work for training. Should we check if it is in the training mode in every place? |
added checking for now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* optimize llama xpu rope * fix bug * fix style * refine append cache * remove check * do not cache cos sin * remove unnecessary changes * clean up * fix style * check for training
Description
using bigdl-llm fused rope for llama to reduce generation latency