Update Server's README with undocumented options for RoPE, YaRN, and KV cache quantization #7013

K-Mistele · 2024-04-30T21:10:24Z

I recently updated my LLama.cpp and found that there are a number of server CLI options which are not described in the README including for RoPE, YaRN, and KV cache quantization as well as flash attention.

This PR updates the README for the server to include these options as well as their possible values. Future work to be done here (not in this PR, but in a future one by me) will probably include a guide to using RoPE/YaRN scaling since the configuration of these parameters is non-obvious and requires going through some older issues to figure out what expected values might look like.

phymbert · 2024-05-01T15:27:58Z

examples/server/README.md

+- `--yarn-beta-fast N`: YaRN: low correction dim or beta (default: 32.0)
+- `--pooling` : Pooling type for embeddings, use model default if unspecified. Options are `none`, `mean`, `cls`
+- `-dt N`, `--defrag-thold N`: KV cache defragmentation threshold (default: -1.0, < 0 = disabled)
+- `-fa`, `--flash-attn` : enable flash attention (default: disabled).


Should we consider adding this to the hot topic section, @ggerganov ?

probably worth noting that the flash attn support is somewhat experimental and may not work on all platforms.

Also I have been meaning to put together a guide on RoPE / YaRN / grouped attn (self-extend) configuration for different models & length extension factors but haven't gotten to it yet

K-Mistele · 2024-05-05T22:05:30Z

bump

Jeximo

There seems to be a trailing whitespace blocking editor.config

feat: update server README with undocumented options

942e2be

phymbert reviewed May 1, 2024

View reviewed changes

Merge branch 'ggerganov:master' into master

5f4d9db

Jeximo suggested changes May 7, 2024

View reviewed changes

ggerganov approved these changes May 7, 2024

View reviewed changes

ggerganov merged commit 260b7c6 into ggerganov:master May 7, 2024
23 of 24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update Server's README with undocumented options for RoPE, YaRN, and KV cache quantization #7013

Update Server's README with undocumented options for RoPE, YaRN, and KV cache quantization #7013

K-Mistele commented Apr 30, 2024

phymbert May 1, 2024

K-Mistele May 1, 2024

K-Mistele commented May 5, 2024

Jeximo left a comment

Update Server's README with undocumented options for RoPE, YaRN, and KV cache quantization #7013

Update Server's README with undocumented options for RoPE, YaRN, and KV cache quantization #7013

Conversation

K-Mistele commented Apr 30, 2024

phymbert May 1, 2024

Choose a reason for hiding this comment

K-Mistele May 1, 2024

Choose a reason for hiding this comment

K-Mistele commented May 5, 2024

Jeximo left a comment

Choose a reason for hiding this comment