Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Server's README with undocumented options for RoPE, YaRN, and KV cache quantization #7013

Merged
merged 2 commits into from
May 7, 2024

Conversation

K-Mistele
Copy link
Contributor

I recently updated my LLama.cpp and found that there are a number of server CLI options which are not described in the README including for RoPE, YaRN, and KV cache quantization as well as flash attention.

This PR updates the README for the server to include these options as well as their possible values. Future work to be done here (not in this PR, but in a future one by me) will probably include a guide to using RoPE/YaRN scaling since the configuration of these parameters is non-obvious and requires going through some older issues to figure out what expected values might look like.

- `--yarn-beta-fast N`: YaRN: low correction dim or beta (default: 32.0)
- `--pooling` : Pooling type for embeddings, use model default if unspecified. Options are `none`, `mean`, `cls`
- `-dt N`, `--defrag-thold N`: KV cache defragmentation threshold (default: -1.0, < 0 = disabled)
- `-fa`, `--flash-attn` : enable flash attention (default: disabled).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider adding this to the hot topic section, @ggerganov ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably worth noting that the flash attn support is somewhat experimental and may not work on all platforms.

Also I have been meaning to put together a guide on RoPE / YaRN / grouped attn (self-extend) configuration for different models & length extension factors but haven't gotten to it yet

@K-Mistele
Copy link
Contributor Author

bump

Copy link
Contributor

@Jeximo Jeximo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be a trailing whitespace blocking editor.config

@ggerganov ggerganov merged commit 260b7c6 into ggerganov:master May 7, 2024
23 of 24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants