forked from ggerganov/llama.cpp
-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
server: tests: passkey challenge / self-extend with context shift demo (
ggerganov#5832) * server: tests: add models endpoint scenario * server: /v1/models add some metadata * server: tests: add debug field in context before scenario * server: tests: download model from HF, add batch size * server: tests: add passkey test * server: tests: add group attention params * server: do not truncate prompt tokens if self-extend through group attention is enabled * server: logs: do not truncate log values * server: tests - passkey - first good working value of nga * server: tests: fix server timeout * server: tests: fix passkey, add doc, fix regex content matching, fix timeout * server: tests: fix regex content matching * server: tests: schedule slow tests on master * server: metrics: fix when no prompt processed * server: tests: self-extend add llama-2-7B and Mixtral-8x7B-v0.1 * server: tests: increase timeout for completion * server: tests: keep only the PHI-2 test * server: tests: passkey add a negative test
- Loading branch information
1 parent
acdfeea
commit 64c8183
Showing
14 changed files
with
362 additions
and
111 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,47 +1,67 @@ | ||
# Server tests | ||
|
||
Python based server tests scenario using [BDD](https://en.wikipedia.org/wiki/Behavior-driven_development) and [behave](https://behave.readthedocs.io/en/latest/): | ||
* [issues.feature](./features/issues.feature) Pending issues scenario | ||
* [parallel.feature](./features/parallel.feature) Scenario involving multi slots and concurrent requests | ||
* [security.feature](./features/security.feature) Security, CORS and API Key | ||
* [server.feature](./features/server.feature) Server base scenario: completion, embedding, tokenization, etc... | ||
Python based server tests scenario using [BDD](https://en.wikipedia.org/wiki/Behavior-driven_development) | ||
and [behave](https://behave.readthedocs.io/en/latest/): | ||
|
||
* [issues.feature](./features/issues.feature) Pending issues scenario | ||
* [parallel.feature](./features/parallel.feature) Scenario involving multi slots and concurrent requests | ||
* [security.feature](./features/security.feature) Security, CORS and API Key | ||
* [server.feature](./features/server.feature) Server base scenario: completion, embedding, tokenization, etc... | ||
|
||
Tests target GitHub workflows job runners with 4 vCPU. | ||
|
||
Requests are using [aiohttp](https://docs.aiohttp.org/en/stable/client_reference.html), [asyncio](https://docs.python.org/fr/3/library/asyncio.html) based http client. | ||
Requests are | ||
using [aiohttp](https://docs.aiohttp.org/en/stable/client_reference.html), [asyncio](https://docs.python.org/fr/3/library/asyncio.html) | ||
based http client. | ||
|
||
Note: If the host architecture inference speed is faster than GitHub runners one, parallel scenario may randomly fail. To mitigate it, you can increase values in `n_predict`, `kv_size`. | ||
Note: If the host architecture inference speed is faster than GitHub runners one, parallel scenario may randomly fail. | ||
To mitigate it, you can increase values in `n_predict`, `kv_size`. | ||
|
||
### Install dependencies | ||
|
||
`pip install -r requirements.txt` | ||
|
||
### Run tests | ||
|
||
1. Build the server | ||
|
||
```shell | ||
cd ../../.. | ||
mkdir build | ||
cd build | ||
cmake ../ | ||
cmake --build . --target server | ||
``` | ||
2. download required models: | ||
1. `../../../scripts/hf.sh --repo ggml-org/models --file tinyllamas/stories260K.gguf` | ||
3. Start the test: `./tests.sh` | ||
|
||
2. Start the test: `./tests.sh` | ||
|
||
It's possible to override some scenario steps values with environment variables: | ||
- `PORT` -> `context.server_port` to set the listening port of the server during scenario, default: `8080` | ||
- `LLAMA_SERVER_BIN_PATH` -> to change the server binary path, default: `../../../build/bin/server` | ||
- `DEBUG` -> "ON" to enable steps and server verbose mode `--verbose` | ||
- `SERVER_LOG_FORMAT_JSON` -> if set switch server logs to json format | ||
|
||
| variable | description | | ||
|--------------------------|------------------------------------------------------------------------------------------------| | ||
| `PORT` | `context.server_port` to set the listening port of the server during scenario, default: `8080` | | ||
| `LLAMA_SERVER_BIN_PATH` | to change the server binary path, default: `../../../build/bin/server` | | ||
| `DEBUG` | "ON" to enable steps and server verbose mode `--verbose` | | ||
| `SERVER_LOG_FORMAT_JSON` | if set switch server logs to json format | | ||
| `N_GPU_LAYERS` | number of model layers to offload to VRAM `-ngl --n-gpu-layers` | | ||
|
||
### Run @bug, @wip or @wrong_usage annotated scenario | ||
|
||
Feature or Scenario must be annotated with `@llama.cpp` to be included in the default scope. | ||
|
||
- `@bug` annotation aims to link a scenario with a GitHub issue. | ||
- `@wrong_usage` are meant to show user issue that are actually an expected behavior | ||
- `@wip` to focus on a scenario working in progress | ||
- `@slow` heavy test, disabled by default | ||
|
||
To run a scenario annotated with `@bug`, start: | ||
`DEBUG=ON ./tests.sh --no-skipped --tags bug` | ||
|
||
```shell | ||
DEBUG=ON ./tests.sh --no-skipped --tags bug | ||
``` | ||
|
||
After changing logic in `steps.py`, ensure that `@bug` and `@wrong_usage` scenario are updated. | ||
|
||
```shell | ||
./tests.sh --no-skipped --tags bug,wrong_usage || echo "should failed but compile" | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,5 @@ | ||
# List of ongoing issues | ||
# run with: DEBUG=ON ./tests.sh --no-skipped --tags bug | ||
@bug | ||
Feature: Issues | ||
# No confirmed issue at the moment |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,55 @@ | ||
# run with: ./tests.sh --no-skipped --tags passkey | ||
@passkey | ||
@slow | ||
Feature: Passkey / Self-extend with context shift | ||
|
||
Background: Server startup | ||
Given a server listening on localhost:8080 | ||
|
||
# Generates a long text of junk and inserts a secret passkey number inside it. | ||
# Then we query the LLM for the secret passkey. | ||
# see #3856 and #4810 | ||
Scenario Outline: Passkey | ||
Given a model file <hf_file> from HF repo <hf_repo> | ||
And <n_batch> as batch size | ||
And <n_junk> as number of junk | ||
And <n_predicted> server max tokens to predict | ||
And 42 as seed | ||
And <n_ctx> KV cache size | ||
And 1 slots | ||
And <n_ga> group attention factor to extend context size through self-extend | ||
And <n_ga_w> group attention width to extend context size through self-extend | ||
# Can be override with N_GPU_LAYERS | ||
And <ngl> GPU offloaded layers | ||
Then the server is starting | ||
Then the server is healthy | ||
Given available models | ||
Then model 0 is trained on <n_ctx_train> tokens context | ||
Given a prefix prompt: | ||
""" | ||
here is an important info hidden inside a lot of irrelevant text. Find it and memorize them. I will quiz you about the important information there. | ||
""" | ||
And a passkey prompt template: | ||
""" | ||
The pass key is <passkey> Remember it. <passkey> is the pass key. | ||
""" | ||
And a junk suffix prompt: | ||
""" | ||
The grass is green. The sky is blue. The sun is yellow. Here we go. There and back again. | ||
""" | ||
And a suffix prompt: | ||
""" | ||
What is the pass key? The pass key is | ||
""" | ||
Given a "<passkey>" passkey challenge prompt with the passkey inserted every <i_pos> junk | ||
And a completion request with no api error | ||
Then <n_predicted> tokens are predicted matching <re_content> | ||
|
||
Examples: | ||
| hf_repo | hf_file | n_ctx_train | ngl | n_ctx | n_batch | n_ga | n_ga_w | n_junk | i_pos | passkey | n_predicted | re_content | | ||
| TheBloke/phi-2-GGUF | phi-2.Q4_K_M.gguf | 2048 | 5 | 8192 | 512 | 4 | 512 | 250 | 50 | 42 | 1 | 42 | | ||
| TheBloke/phi-2-GGUF | phi-2.Q4_K_M.gguf | 2048 | 5 | 8192 | 512 | 2 | 512 | 250 | 50 | 42 | 1 | \b((?!42)\w)+\b | | ||
#| TheBloke/Llama-2-7B-GGUF | llama-2-7b.Q2_K.gguf | 4096 | 3 | 16384 | 512 | 4 | 512 | 500 | 300 | 1234 | 5 | 1234 | | ||
#| TheBloke/Mixtral-8x7B-v0.1-GGUF | mixtral-8x7b-v0.1.Q2_K.gguf | 32768 | 2 | 16384 | 512 | 4 | 512 | 500 | 100 | 0987 | 5 | 0 | ||
# 987 | | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.