Skip to content
This repository has been archived by the owner on Oct 25, 2024. It is now read-only.

[LLM Runtime] Enable Mistral-7b #552

Merged
merged 3 commits into from
Oct 26, 2023
Merged

[LLM Runtime] Enable Mistral-7b #552

merged 3 commits into from
Oct 26, 2023

Conversation

intellinjun
Copy link
Contributor

@intellinjun intellinjun commented Oct 26, 2023

Type of Change

feature or bug fix or documentation or others
API changed or not:not
huggingface models:

Description

detail description
JIRA ticket: 917
TODO

  • n_head_kv
  • extension tests

Expected Behavior & Potential Risk

the expected behavior that triggered by this PR

How has this PR been tested?

how to reproduce the test (including hardware information)

Dependency Change?

any library dependency introduced or removed

Signed-off-by: intellinjun <jun.lin@intel.com>
Signed-off-by: intellinjun <jun.lin@intel.com>
@intellinjun
Copy link
Contributor Author

image

@intellinjun intellinjun requested a review from a32543254 October 26, 2023 02:29
Signed-off-by: intellinjun <105184542+intellinjun@users.noreply.github.com>
@zhenwei-intel
Copy link
Contributor

Do we support sliding window attention (SWA) ?
https://mistral.ai/news/announcing-mistral-7b/

@hshen14
Copy link
Contributor

hshen14 commented Oct 26, 2023

Do we support sliding window attention (SWA) ? https://mistral.ai/news/announcing-mistral-7b/

Talked with Jun, SWA is not supported yet since the code actually reuses the llama.cpp without SWA support. We can enable StreamingLLM on this model through a separate PR.

@hshen14 hshen14 merged commit 7d14956 into main Oct 26, 2023
@hshen14 hshen14 deleted the mistral_graph branch October 26, 2023 04:58
@zhenwei-intel
Copy link
Contributor

Do we support sliding window attention (SWA) ? https://mistral.ai/news/announcing-mistral-7b/

Talked with Jun, SWA is not supported yet since the code actually reuses the llama.cpp without SWA support. We can enable StreamingLLM on this model through a separate PR.

The StreamingLLM function is already available, just specify n_ keep=4 and n_discard=-1

@intellinjun
Copy link
Contributor Author

image
without MHA fusion, use gcc version=12, if use gcc version=13 will go to MHA fusion(mistral-7b use group query attention same as llama2-70b, it is unsupport in llama.cpp now, next week will support gqa fusion)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants