Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

using bigdl-llm fused rope for llama #9066

Merged
merged 10 commits into from
Oct 6, 2023

Conversation

yangw1234
Copy link
Contributor

@yangw1234 yangw1234 commented Sep 27, 2023

Description

using bigdl-llm fused rope for llama to reduce generation latency

@yangw1234 yangw1234 changed the title [WIP] fused rope [WIP] fused rope and rmsnorm Oct 3, 2023
q_embed = torch.empty(query_states.shape, dtype=query_states.dtype, device=query_states.device)
k_embed = torch.empty(key_states.shape, dtype=key_states.dtype, device=key_states.device)

linear_q4_0.apply_rotary_embedding_half_qk(query_states, key_states, position_ids, q_embed, k_embed)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this only works on GPU, yes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, only on GPU.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think cpu ipex has a similar function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think cpu ipex has a similar function.

We can add our GPU optimizations first

@yangw1234
Copy link
Contributor Author

I think I am going to separate rms_norm, rope and other changes into different PRs with rms_norm being the first.

@yangw1234 yangw1234 changed the title [WIP] fused rope and rmsnorm using bigdl-llm fused rope for llama Oct 5, 2023
@yangw1234
Copy link
Contributor Author

performance updated here https://github.com/analytics-zoo/nano/issues/606

@jason-dai would you mind reviewing again?

@yangw1234
Copy link
Contributor Author

These optimizations does not work for training.

Should we check if it is in the training mode in every place?
Or should we can assume user should not set optimize_model=True in training?

@jason-dai

@yangw1234
Copy link
Contributor Author

These optimizations does not work for training.

Should we check if it is in the training mode in every place? Or should we can assume user should not set optimize_model=True in training?

@jason-dai

added checking for now

Copy link
Contributor

@jason-dai jason-dai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yangw1234 yangw1234 merged commit 1739372 into intel-analytics:main Oct 6, 2023
16 checks passed
liu-shaojun pushed a commit that referenced this pull request Mar 25, 2024
* optimize llama xpu rope

* fix bug

* fix style

* refine append cache

* remove check

* do not cache cos sin

* remove unnecessary changes

* clean up

* fix style

* check for training
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants