Skip to content

Commit

Permalink
server: tests: passkey challenge / self-extend with context shift demo (
Browse files Browse the repository at this point in the history
ggerganov#5832)

* server: tests: add models endpoint scenario

* server: /v1/models add some metadata

* server: tests: add debug field in context before scenario

* server: tests: download model from HF, add batch size

* server: tests: add passkey test

* server: tests: add group attention params

* server: do not truncate prompt tokens if self-extend through group attention is enabled

* server: logs: do not truncate log values

* server: tests - passkey - first good working value of nga

* server: tests: fix server timeout

* server: tests: fix passkey, add doc, fix regex content matching, fix timeout

* server: tests: fix regex content matching

* server: tests: schedule slow tests on master

* server: metrics: fix when no prompt processed

* server: tests: self-extend add llama-2-7B and Mixtral-8x7B-v0.1

* server: tests: increase timeout for completion

* server: tests: keep only the PHI-2 test

* server: tests: passkey add a negative test
  • Loading branch information
phymbert authored and jordankanter committed Mar 13, 2024
1 parent acdfeea commit 64c8183
Show file tree
Hide file tree
Showing 14 changed files with 362 additions and 111 deletions.
15 changes: 9 additions & 6 deletions .github/workflows/server.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ on:
pull_request:
types: [opened, synchronize, reopened]
paths: ['.github/workflows/server.yml', '**/CMakeLists.txt', '**/Makefile', '**/*.h', '**/*.hpp', '**/*.c', '**/*.cpp', '**/*.cu', '**/*.swift', '**/*.m', 'examples/server/tests/**.*']
schedule:
- cron: '00 0 * * *'

jobs:
server:
Expand Down Expand Up @@ -70,14 +72,15 @@ jobs:
run: |
pip install -r examples/server/tests/requirements.txt
- name: Download models
id: download_models
- name: Tests
id: server_integration_tests
run: |
cd examples/server/tests
../../../scripts/hf.sh --repo ggml-org/models --file tinyllamas/stories260K.gguf
PORT=8888 ./tests.sh
- name: Tests
id: server_integration_test
- name: Slow tests
id: server_integration_tests_slow
if: github.event.schedule != ''
run: |
cd examples/server/tests
PORT=8888 ./tests.sh
PORT=8888 ./tests.sh --stop --no-skipped --no-capture --tags slow
46 changes: 31 additions & 15 deletions examples/server/server.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -441,8 +441,8 @@ struct llama_server_context
const int ga_w = params.grp_attn_w;

if (ga_n != 1) {
GGML_ASSERT(ga_n > 0 && "ga_n must be positive"); // NOLINT
GGML_ASSERT(ga_w % ga_n == 0 && "ga_w must be a multiple of ga_n"); // NOLINT
GGML_ASSERT(ga_n > 0 && "ga_n must be positive"); // NOLINT
GGML_ASSERT(ga_w % ga_n == 0 && "ga_w must be a multiple of ga_n"); // NOLINT
//GGML_ASSERT(n_ctx_train % ga_w == 0 && "n_ctx_train must be a multiple of ga_w"); // NOLINT
//GGML_ASSERT(n_ctx >= n_ctx_train * ga_n && "n_ctx must be at least n_ctx_train * ga_n"); // NOLINT

Expand Down Expand Up @@ -1709,8 +1709,8 @@ struct llama_server_context
}
slot.params.n_keep = std::min(slot.n_ctx - 4, slot.params.n_keep);

// if input prompt is too big, truncate it
if (slot.n_prompt_tokens >= slot.n_ctx)
// if input prompt is too big, truncate it, if group attention self-extend is disabled
if (slot.ga_n == 1 && slot.n_prompt_tokens >= slot.n_ctx)
{
const int n_left = slot.n_ctx - slot.params.n_keep;
const int n_block_size = n_left / 2;
Expand Down Expand Up @@ -1785,9 +1785,11 @@ struct llama_server_context
}

LOG_INFO("slot progression", {
{ "slot_id", slot.id },
{ "task_id", slot.task_id },
{ "n_past", slot.n_past },
{ "slot_id", slot.id },
{ "task_id", slot.task_id },
{ "n_past", slot.n_past },
{ "n_past_se", slot.n_past_se },
{ "ga_i", slot.ga_i },
{ "n_prompt_tokens_processed", slot.n_prompt_tokens_processed }
});
}
Expand Down Expand Up @@ -2001,6 +2003,17 @@ struct llama_server_context
LOG_VERBOSE("slots updated", {});
return true;
}

json model_meta() {
return json{
{"vocab_type", llama_vocab_type(model)},
{"n_vocab", llama_n_vocab(model)},
{"n_ctx_train", llama_n_ctx_train(model)},
{"n_embd", llama_n_embd(model)},
{"n_params", llama_model_n_params(model)},
{"size", llama_model_size(model)},
};
}
};

static void server_print_usage(const char *argv0, const gpt_params &params,
Expand Down Expand Up @@ -2911,9 +2924,10 @@ int main(int argc, char **argv)
for (const auto& metric_def : metrics_def) {
std::string name = metric_def["name"];
std::string help = metric_def["help"];
prometheus << "# HELP llamacpp:" << name << " " << help << "\n"
<< "# TYPE llamacpp:" << name << " " << type << "\n"
<< "llamacpp:" << name << " " << metric_def["value"] << "\n";
auto value = json_value(metric_def, "value", 0);
prometheus << "# HELP llamacpp:" << name << " " << help << "\n"
<< "# TYPE llamacpp:" << name << " " << type << "\n"
<< "llamacpp:" << name << " " << value << "\n";
}
}

Expand Down Expand Up @@ -2994,6 +3008,7 @@ int main(int argc, char **argv)
state.store(SERVER_STATE_READY);
LOG_INFO("model loaded", {});
}
const auto model_meta = llama.model_meta();

if (sparams.chat_template.empty()) { // custom chat template is not supplied
// check if the template comes with the model is supported by us
Expand Down Expand Up @@ -3143,7 +3158,7 @@ int main(int argc, char **argv)
}
});

svr.Get("/v1/models", [&params](const httplib::Request& req, httplib::Response& res)
svr.Get("/v1/models", [&params, &model_meta](const httplib::Request& req, httplib::Response& res)
{
res.set_header("Access-Control-Allow-Origin", req.get_header_value("Origin"));
std::time_t t = std::time(0);
Expand All @@ -3152,10 +3167,11 @@ int main(int argc, char **argv)
{"object", "list"},
{"data", {
{
{"id", params.model_alias},
{"object", "model"},
{"created", t},
{"owned_by", "llamacpp"}
{"id", params.model_alias},
{"object", "model"},
{"created", t},
{"owned_by", "llamacpp"},
{"meta", model_meta}
},
}}
};
Expand Down
50 changes: 35 additions & 15 deletions examples/server/tests/README.md
Original file line number Diff line number Diff line change
@@ -1,47 +1,67 @@
# Server tests

Python based server tests scenario using [BDD](https://en.wikipedia.org/wiki/Behavior-driven_development) and [behave](https://behave.readthedocs.io/en/latest/):
* [issues.feature](./features/issues.feature) Pending issues scenario
* [parallel.feature](./features/parallel.feature) Scenario involving multi slots and concurrent requests
* [security.feature](./features/security.feature) Security, CORS and API Key
* [server.feature](./features/server.feature) Server base scenario: completion, embedding, tokenization, etc...
Python based server tests scenario using [BDD](https://en.wikipedia.org/wiki/Behavior-driven_development)
and [behave](https://behave.readthedocs.io/en/latest/):

* [issues.feature](./features/issues.feature) Pending issues scenario
* [parallel.feature](./features/parallel.feature) Scenario involving multi slots and concurrent requests
* [security.feature](./features/security.feature) Security, CORS and API Key
* [server.feature](./features/server.feature) Server base scenario: completion, embedding, tokenization, etc...

Tests target GitHub workflows job runners with 4 vCPU.

Requests are using [aiohttp](https://docs.aiohttp.org/en/stable/client_reference.html), [asyncio](https://docs.python.org/fr/3/library/asyncio.html) based http client.
Requests are
using [aiohttp](https://docs.aiohttp.org/en/stable/client_reference.html), [asyncio](https://docs.python.org/fr/3/library/asyncio.html)
based http client.

Note: If the host architecture inference speed is faster than GitHub runners one, parallel scenario may randomly fail. To mitigate it, you can increase values in `n_predict`, `kv_size`.
Note: If the host architecture inference speed is faster than GitHub runners one, parallel scenario may randomly fail.
To mitigate it, you can increase values in `n_predict`, `kv_size`.

### Install dependencies

`pip install -r requirements.txt`

### Run tests

1. Build the server

```shell
cd ../../..
mkdir build
cd build
cmake ../
cmake --build . --target server
```
2. download required models:
1. `../../../scripts/hf.sh --repo ggml-org/models --file tinyllamas/stories260K.gguf`
3. Start the test: `./tests.sh`

2. Start the test: `./tests.sh`

It's possible to override some scenario steps values with environment variables:
- `PORT` -> `context.server_port` to set the listening port of the server during scenario, default: `8080`
- `LLAMA_SERVER_BIN_PATH` -> to change the server binary path, default: `../../../build/bin/server`
- `DEBUG` -> "ON" to enable steps and server verbose mode `--verbose`
- `SERVER_LOG_FORMAT_JSON` -> if set switch server logs to json format

| variable | description |
|--------------------------|------------------------------------------------------------------------------------------------|
| `PORT` | `context.server_port` to set the listening port of the server during scenario, default: `8080` |
| `LLAMA_SERVER_BIN_PATH` | to change the server binary path, default: `../../../build/bin/server` |
| `DEBUG` | "ON" to enable steps and server verbose mode `--verbose` |
| `SERVER_LOG_FORMAT_JSON` | if set switch server logs to json format |
| `N_GPU_LAYERS` | number of model layers to offload to VRAM `-ngl --n-gpu-layers` |

### Run @bug, @wip or @wrong_usage annotated scenario

Feature or Scenario must be annotated with `@llama.cpp` to be included in the default scope.

- `@bug` annotation aims to link a scenario with a GitHub issue.
- `@wrong_usage` are meant to show user issue that are actually an expected behavior
- `@wip` to focus on a scenario working in progress
- `@slow` heavy test, disabled by default

To run a scenario annotated with `@bug`, start:
`DEBUG=ON ./tests.sh --no-skipped --tags bug`

```shell
DEBUG=ON ./tests.sh --no-skipped --tags bug
```

After changing logic in `steps.py`, ensure that `@bug` and `@wrong_usage` scenario are updated.

```shell
./tests.sh --no-skipped --tags bug,wrong_usage || echo "should failed but compile"
```
5 changes: 4 additions & 1 deletion examples/server/tests/features/environment.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,10 @@


def before_scenario(context, scenario):
print(f"\x1b[33;42mStarting new scenario: {scenario.name}!\x1b[0m")
context.debug = 'DEBUG' in os.environ and os.environ['DEBUG'] == 'ON'
if context.debug:
print("DEBUG=ON\n")
print(f"\x1b[33;42mStarting new scenario: {scenario.name}!\x1b[0m\n")
port = 8080
if 'PORT' in os.environ:
port = int(os.environ['PORT'])
Expand Down
1 change: 1 addition & 0 deletions examples/server/tests/features/issues.feature
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# List of ongoing issues
# run with: DEBUG=ON ./tests.sh --no-skipped --tags bug
@bug
Feature: Issues
# No confirmed issue at the moment
5 changes: 3 additions & 2 deletions examples/server/tests/features/parallel.feature
Original file line number Diff line number Diff line change
@@ -1,11 +1,12 @@
@llama.cpp
@parallel
Feature: Parallel

Background: Server startup
Given a server listening on localhost:8080
And a model file stories260K.gguf
And a model alias tinyllama-2
And a model file tinyllamas/stories260K.gguf from HF repo ggml-org/models
And 42 as server seed
And 512 as batch size
And 64 KV cache size
And 2 slots
And embeddings extraction
Expand Down
55 changes: 55 additions & 0 deletions examples/server/tests/features/passkey.feature
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# run with: ./tests.sh --no-skipped --tags passkey
@passkey
@slow
Feature: Passkey / Self-extend with context shift

Background: Server startup
Given a server listening on localhost:8080

# Generates a long text of junk and inserts a secret passkey number inside it.
# Then we query the LLM for the secret passkey.
# see #3856 and #4810
Scenario Outline: Passkey
Given a model file <hf_file> from HF repo <hf_repo>
And <n_batch> as batch size
And <n_junk> as number of junk
And <n_predicted> server max tokens to predict
And 42 as seed
And <n_ctx> KV cache size
And 1 slots
And <n_ga> group attention factor to extend context size through self-extend
And <n_ga_w> group attention width to extend context size through self-extend
# Can be override with N_GPU_LAYERS
And <ngl> GPU offloaded layers
Then the server is starting
Then the server is healthy
Given available models
Then model 0 is trained on <n_ctx_train> tokens context
Given a prefix prompt:
"""
here is an important info hidden inside a lot of irrelevant text. Find it and memorize them. I will quiz you about the important information there.
"""
And a passkey prompt template:
"""
The pass key is <passkey> Remember it. <passkey> is the pass key.
"""
And a junk suffix prompt:
"""
The grass is green. The sky is blue. The sun is yellow. Here we go. There and back again.
"""
And a suffix prompt:
"""
What is the pass key? The pass key is
"""
Given a "<passkey>" passkey challenge prompt with the passkey inserted every <i_pos> junk
And a completion request with no api error
Then <n_predicted> tokens are predicted matching <re_content>

Examples:
| hf_repo | hf_file | n_ctx_train | ngl | n_ctx | n_batch | n_ga | n_ga_w | n_junk | i_pos | passkey | n_predicted | re_content |
| TheBloke/phi-2-GGUF | phi-2.Q4_K_M.gguf | 2048 | 5 | 8192 | 512 | 4 | 512 | 250 | 50 | 42 | 1 | 42 |
| TheBloke/phi-2-GGUF | phi-2.Q4_K_M.gguf | 2048 | 5 | 8192 | 512 | 2 | 512 | 250 | 50 | 42 | 1 | \b((?!42)\w)+\b |
#| TheBloke/Llama-2-7B-GGUF | llama-2-7b.Q2_K.gguf | 4096 | 3 | 16384 | 512 | 4 | 512 | 500 | 300 | 1234 | 5 | 1234 |
#| TheBloke/Mixtral-8x7B-v0.1-GGUF | mixtral-8x7b-v0.1.Q2_K.gguf | 32768 | 2 | 16384 | 512 | 4 | 512 | 500 | 100 | 0987 | 5 | 0
# 987 |

3 changes: 2 additions & 1 deletion examples/server/tests/features/security.feature
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
@llama.cpp
@security
Feature: Security

Background: Server startup with an api key defined
Given a server listening on localhost:8080
And a model file stories260K.gguf
And a model file tinyllamas/stories260K.gguf from HF repo ggml-org/models
And a server api key llama.cpp
Then the server is starting
Then the server is healthy
Expand Down
23 changes: 15 additions & 8 deletions examples/server/tests/features/server.feature
Original file line number Diff line number Diff line change
@@ -1,15 +1,17 @@
@llama.cpp
@server
Feature: llama.cpp server

Background: Server startup
Given a server listening on localhost:8080
And a model file stories260K.gguf
And a model file tinyllamas/stories260K.gguf from HF repo ggml-org/models
And a model alias tinyllama-2
And 42 as server seed
# KV Cache corresponds to the total amount of tokens
# that can be stored across all independent sequences: #4130
# see --ctx-size and #5568
And 32 KV cache size
And 512 as batch size
And 1 slots
And embeddings extraction
And 32 server max tokens to predict
Expand All @@ -29,9 +31,9 @@ Feature: llama.cpp server
And prometheus metrics are exposed

Examples: Prompts
| prompt | n_predict | re_content | n_predicted |
| I believe the meaning of life is | 8 | (read<or>going)+ | 8 |
| Write a joke about AI | 64 | (park<or>friends<or>scared<or>always)+ | 32 |
| prompt | n_predict | re_content | n_predicted |
| I believe the meaning of life is | 8 | (read\|going)+ | 8 |
| Write a joke about AI | 64 | (park\|friends\|scared\|always)+ | 32 |

Scenario Outline: OAI Compatibility
Given a model <model>
Expand All @@ -43,9 +45,9 @@ Feature: llama.cpp server
Then <n_predicted> tokens are predicted matching <re_content>

Examples: Prompts
| model | system_prompt | user_prompt | max_tokens | re_content | n_predicted | enable_streaming |
| llama-2 | Book | What is the best book | 8 | (Mom<or>what)+ | 8 | disabled |
| codellama70b | You are a coding assistant. | Write the fibonacci function in c++. | 64 | (thanks<or>happy<or>bird)+ | 32 | enabled |
| model | system_prompt | user_prompt | max_tokens | re_content | n_predicted | enable_streaming |
| llama-2 | Book | What is the best book | 8 | (Mom\|what)+ | 8 | disabled |
| codellama70b | You are a coding assistant. | Write the fibonacci function in c++. | 64 | (thanks\|happy\|bird)+ | 32 | enabled |

Scenario: Embedding
When embeddings are computed for:
Expand Down Expand Up @@ -75,10 +77,15 @@ Feature: llama.cpp server
When an OAI compatible embeddings computation request for multiple inputs
Then embeddings are generated


Scenario: Tokenize / Detokenize
When tokenizing:
"""
What is the capital of France ?
"""
Then tokens can be detokenize

Scenario: Models available
Given available models
Then 1 models are supported
Then model 0 is identified by tinyllama-2
Then model 0 is trained on 128 tokens context
Loading

0 comments on commit 64c8183

Please sign in to comment.