Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: init functional tests #5566

Merged
merged 100 commits into from
Feb 24, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
100 commits
Select commit Hold shift + click to select a range
157bcf2
server: init functional test
phymbert Feb 18, 2024
9b63d70
server: tests: reduce number of files, all in one tests shell script
phymbert Feb 19, 2024
6497755
server: tests: fix ci workflow
phymbert Feb 19, 2024
4e5245e
server: tests: fix ci workflow
phymbert Feb 19, 2024
30aa323
server: tests: fix ci workflow
phymbert Feb 19, 2024
fe9866a
server: tests: use ngxson llama_xs_q4.bin
phymbert Feb 19, 2024
1680599
server: tests: build only the server
phymbert Feb 19, 2024
8bb586b
server: tests: add health check and concurrent request example
phymbert Feb 20, 2024
6c95ec6
server: tests: change model to: @karpathy's tinyllamas
phymbert Feb 20, 2024
56583be
server: tests: refactor steps and vocabulary
phymbert Feb 20, 2024
9b7ea97
server: tests: add OAI stream test, fix file end of line, fast fail b…
phymbert Feb 20, 2024
11adf1d
server: tests: add OAI multi user scenario
phymbert Feb 20, 2024
c355f76
server: tests: slots endpoint checks
phymbert Feb 20, 2024
367b59a
server: tests: check for infinite loops
phymbert Feb 20, 2024
b9f8390
server: tests: check for infinite loops
phymbert Feb 20, 2024
0772884
server: tests: add a constant seed in completion request
phymbert Feb 20, 2024
6b9dc4f
server: tests: add infinite loop
phymbert Feb 20, 2024
68574c6
server: tests: add infinite loop scenario
phymbert Feb 20, 2024
b0b6d83
server: tests: add infinite loop scenario
phymbert Feb 20, 2024
1ecda0d
server: tests: disable issue 3969 scenario
phymbert Feb 20, 2024
e6d4820
server: tests: add embeddings scenario
phymbert Feb 20, 2024
1065f6d
server: tests: add tokenize/detokenize scenario
phymbert Feb 20, 2024
19664b9
server: tests: detokenize endpoint issue reference added
phymbert Feb 20, 2024
6dcbcfe
server: tests: simplify completion scenario
phymbert Feb 20, 2024
672d98f
server: tests: CORS and api key checks scenario
phymbert Feb 21, 2024
3322bfa
server: tests: add a small check to be sure all started threads have …
phymbert Feb 21, 2024
469af4b
server: tests: change CI workflow trigger
phymbert Feb 21, 2024
2a37bd6
server: tests: fix the multi users infinite loop test
phymbert Feb 21, 2024
f1d4138
server : fix initialization thread issues
ggerganov Feb 21, 2024
600cbeb
server: test: ci change the GitHub workflow trigger
phymbert Feb 21, 2024
68b8d4e
Merge remote-tracking branch 'origin/master' into test/server-add-ci-…
phymbert Feb 21, 2024
6406208
server: tests:
phymbert Feb 21, 2024
01cca66
server: tests: ci fix model download path
phymbert Feb 21, 2024
534998d
server: tests: ci tests.sh exit code
phymbert Feb 21, 2024
a697cd1
minor : fix missing new line
ggerganov Feb 22, 2024
41676d9
ci : actually no reason to exclude GPU code from triggers
ggerganov Feb 22, 2024
016b221
server: fix health/slots endpoint slot state access available race co…
phymbert Feb 22, 2024
e43406e
server: tests: switch to asyncio for concurrent tests, match result c…
phymbert Feb 22, 2024
597c181
server: tests: ci do not take a model anymore, fix trigger patch
phymbert Feb 22, 2024
8b96bda
Merge remote-tracking branch 'origin/master' into test/server-add-ci-…
phymbert Feb 22, 2024
f820e10
server: tests: ci ensure the server is stopped before scenario, and d…
phymbert Feb 22, 2024
aa591ef
server: tests: add Multi users with total number of tokens to predict…
phymbert Feb 22, 2024
26b66c5
server: tests: Fix some random behavior where the wait for busy statu…
phymbert Feb 22, 2024
51f5274
server: tests: ci triggered on any changes on server example path
phymbert Feb 22, 2024
cba6d4e
server: tests: minor fix missing param.
phymbert Feb 22, 2024
1bd07e5
server: tests: assert embeddings are actually computed, make the embe…
phymbert Feb 23, 2024
14b6ede
server: tests: minor color change
phymbert Feb 23, 2024
b38b9e6
server: tests: minor fix server --alias param passed twice
phymbert Feb 23, 2024
70e9055
server: tests: add log in server start to identify why the server doe…
phymbert Feb 23, 2024
2f756f8
server: tests: allow to override the server port before launching tests
phymbert Feb 23, 2024
6a215e5
server: tests: ci adding container to specify server port and allow t…
phymbert Feb 23, 2024
2bb4732
server: tests: ci adding cmake as it is not present by default in ubu…
phymbert Feb 23, 2024
d0e0050
server: tests: ci adding python3-pip as it is not present by default …
phymbert Feb 23, 2024
6e71126
server: tests: ci adding curl as it is not present by default in ubun…
phymbert Feb 23, 2024
6bba3be
server: tests: ci adding psmisc as it is not present by default in ub…
phymbert Feb 23, 2024
5110de0
server: tests: fix coloring console
phymbert Feb 23, 2024
bedf37c
server: tests: reducing n_ctx and n_predict for // prompts as it is t…
phymbert Feb 23, 2024
530d3ae
server: tests: reducing sleep time during scenario
phymbert Feb 23, 2024
36ddb96
server: tests: parallel fix server is started twice, add colors to he…
phymbert Feb 23, 2024
0b0f056
server: tests: ci : build and run tests for all matrix defines, sanit…
phymbert Feb 23, 2024
29f8833
server: tests: ci : fix wget missing
phymbert Feb 23, 2024
12bb797
server: tests: ci : add git
phymbert Feb 23, 2024
68cd1a4
server: tests: ci : matrix cuda
phymbert Feb 23, 2024
86896aa
server: tests: ci : continue on error
phymbert Feb 23, 2024
334902b
server: tests: ci : fix step id duplicated
phymbert Feb 23, 2024
fce2e00
server: tests: ci : fix cuda install
phymbert Feb 23, 2024
e4fb790
server: test: ci fix cuda build
phymbert Feb 23, 2024
2edd995
server: test: ci fix cublas build
phymbert Feb 23, 2024
fa51bac
server: test: ci fix matrix
phymbert Feb 23, 2024
606738e
server: test: ci fix clblast
phymbert Feb 23, 2024
d159e29
server: test: ci fix openblas build
phymbert Feb 23, 2024
13863ef
server: test: ci matrix
phymbert Feb 23, 2024
4d3791a
server: test: ci matrix, experimental on matrix avx512 entry which fa…
phymbert Feb 23, 2024
b94809b
server: test: ci cmake remove all warning as it is done by the classi…
phymbert Feb 23, 2024
5a621e7
server: test: ci make arch not available pass the test
phymbert Feb 23, 2024
54ea4d4
server: test: ax512 experimental
phymbert Feb 23, 2024
5b2ce45
server: test: display server logs in case of failure
phymbert Feb 23, 2024
6dc3af5
server: test: fix CUDA LD PATH
phymbert Feb 23, 2024
83c386f
server: test: ci debug LD path
phymbert Feb 23, 2024
0d380ae
server: test: ci debug CI LD path
phymbert Feb 23, 2024
c75e0e1
server: test: ci switch to nvidia based docker image for cuda
phymbert Feb 23, 2024
2c8bf24
server: test: ci give up with nvidia as it requires the nvidia docker…
phymbert Feb 23, 2024
777bdcf
server: test: ci rename step name to Test, change matrix order for be…
phymbert Feb 23, 2024
e10b83a
server: test: ci rename job name to Server
phymbert Feb 23, 2024
4d27466
server: tests: move all requests call to asyncio
phymbert Feb 23, 2024
1c1fd40
server: tests: allow to pass argument to the test file
phymbert Feb 23, 2024
2109743
server: tests: print server logs only on github action
phymbert Feb 23, 2024
30f802d
server: tests: check if the server has not crashed after a scenario
phymbert Feb 23, 2024
6c0e6f4
server: tests: adding concurrent embedding in issue #5655
phymbert Feb 23, 2024
77b8589
server: tests: linter
phymbert Feb 23, 2024
7183149
server: tests: fix concurrent OAI streaming request
phymbert Feb 23, 2024
2d107ba
server: tests: add a note regarding inference speed.
phymbert Feb 23, 2024
124ca77
server: tests: removing debug print
phymbert Feb 24, 2024
5957a2d
server: tests - allow print on debug
phymbert Feb 24, 2024
482eb30
server: tests - README.md add build instruction and notice on @bug an…
phymbert Feb 24, 2024
60781f0
server: tests - add explanation about KV Cache.
phymbert Feb 24, 2024
a779a4b
server: tests - print only in case of DEBUG
phymbert Feb 24, 2024
a2a928c
server: add link to tests in the README.md
phymbert Feb 24, 2024
5ed4452
server: tests: improved README.md
phymbert Feb 24, 2024
99163c8
github issue template: add link to the tests server framework
phymbert Feb 24, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
server: tests: CORS and api key checks scenario
  • Loading branch information
phymbert committed Feb 21, 2024
commit 672d98f6f0acee9f93bf74e44a032eee5942ff5a
40 changes: 30 additions & 10 deletions examples/server/tests/features/server.feature
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Feature: llama.cpp server

Background: Server startup
Given a server listening on localhost:8080 with 2 slots and 42 as seed
Given a server listening on localhost:8080 with 2 slots, 42 as seed and llama.cpp as api key
Then the server is starting
Then the server is healthy

Expand All @@ -13,13 +13,17 @@ Feature: llama.cpp server

@llama.cpp
Scenario Outline: Completion
Given a <prompt> completion request with maximum <n_predict> tokens
Given a prompt <prompt>
And a user api key <api_key>
And <n_predict> max tokens to predict
And a completion request
Then <n_predict> tokens are predicted

Examples: Prompts
| prompt | n_predict |
| I believe the meaning of life is | 128 |
| Write a joke about AI | 512 |
| prompt | n_predict | api_key |
| I believe the meaning of life is | 128 | llama.cpp |
| Write a joke about AI | 512 | llama.cpp |
| say goodbye | 0 | |

@llama.cpp
Scenario Outline: OAI Compatibility
Expand All @@ -28,13 +32,15 @@ Feature: llama.cpp server
And a model <model>
And <max_tokens> max tokens to predict
And streaming is <enable_streaming>
Given an OAI compatible chat completions request
And a user api key <api_key>
Given an OAI compatible chat completions request with an api error <api_error>
Then <max_tokens> tokens are predicted

Examples: Prompts
| model | system_prompt | user_prompt | max_tokens | enable_streaming |
| llama-2 | You are ChatGPT. | Say hello. | 64 | false |
| codellama70b | You are a coding assistant. | Write the fibonacci function in c++. | 512 | true |
| model | system_prompt | user_prompt | max_tokens | enable_streaming | api_key | api_error |
| llama-2 | You are ChatGPT. | Say hello. | 64 | false | llama.cpp | none |
| codellama70b | You are a coding assistant. | Write the fibonacci function in c++. | 512 | true | llama.cpp | none |
| John-Doe | You are an hacker. | Write segfault code in rust. | 0 | true | hackme | raised |

phymbert marked this conversation as resolved.
Show resolved Hide resolved
@llama.cpp
Scenario: Multi users
Expand All @@ -47,6 +53,7 @@ Feature: llama.cpp server
Write another very long music lyrics.
"""
And 32 max tokens to predict
And a user api key llama.cpp
Given concurrent completion requests
Then the server is busy
And all slots are busy
Expand All @@ -57,7 +64,7 @@ Feature: llama.cpp server
@llama.cpp
Scenario: Multi users OAI Compatibility
Given a system prompt "You are an AI assistant."
And a model tinyllama-2
And a model tinyllama-2
Given a prompt:
"""
Write a very long story about AI.
Expand All @@ -68,6 +75,7 @@ Feature: llama.cpp server
"""
And 32 max tokens to predict
And streaming is enabled
And a user api key llama.cpp
Given concurrent OAI completions requests
Then the server is busy
And all slots are busy
Expand Down Expand Up @@ -126,3 +134,15 @@ Feature: llama.cpp server
"""
Then tokens can be detokenize

@llama.cpp
Scenario Outline: CORS Options
When an OPTIONS request is sent from <origin>
Then CORS header <cors_header> is set to <cors_header_value>

Examples: Headers
| origin | cors_header | cors_header_value |
| localhost | Access-Control-Allow-Origin | localhost |
| web.mydomain.fr | Access-Control-Allow-Origin | web.mydomain.fr |
| origin | Access-Control-Allow-Credentials | true |
| web.mydomain.fr | Access-Control-Allow-Methods | POST |
| web.mydomain.fr | Access-Control-Allow-Headers | * |
135 changes: 94 additions & 41 deletions examples/server/tests/features/steps/steps.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,9 @@
from behave import step


@step(u"a server listening on {server_fqdn}:{server_port} with {n_slots} slots and {seed} as seed")
def step_server_config(context, server_fqdn, server_port, n_slots, seed):
@step(
u"a server listening on {server_fqdn}:{server_port} with {n_slots} slots, {seed} as seed and {api_key} as api key")
def step_server_config(context, server_fqdn, server_port, n_slots, seed, api_key):
context.server_fqdn = server_fqdn
context.server_port = int(server_port)
context.n_slots = int(n_slots)
Expand All @@ -19,7 +20,8 @@ def step_server_config(context, server_fqdn, server_port, n_slots, seed):
context.completion_threads = []
context.prompts = []

openai.api_key = 'llama.cpp'
context.api_key = api_key
openai.api_key = context.api_key


@step(u"the server is {expecting_status}")
Expand Down Expand Up @@ -77,14 +79,16 @@ def step_all_slots_status(context, expected_slot_status_string):
request_slots_status(context, expected_slots)


@step(u'a {prompt} completion request with maximum {n_predict} tokens')
def step_request_completion(context, prompt, n_predict):
request_completion(context, prompt, n_predict)
@step(u'a completion request')
def step_request_completion(context):
request_completion(context, context.prompts.pop(), context.n_predict, context.user_api_key)
context.user_api_key = None


@step(u'{predicted_n} tokens are predicted')
def step_n_tokens_predicted(context, predicted_n):
assert_n_tokens_predicted(context.completions[0], int(predicted_n))
if int(predicted_n) > 0:
assert_n_tokens_predicted(context.completions[0], int(predicted_n))


@step(u'a user prompt {user_prompt}')
Expand Down Expand Up @@ -112,24 +116,40 @@ def step_streaming(context, enable_streaming):
context.enable_streaming = enable_streaming == 'enabled' or bool(enable_streaming)


@step(u'an OAI compatible chat completions request')
def step_oai_chat_completions(context):
oai_chat_completions(context, context.user_prompt)
@step(u'a user api key {user_api_key}')
def step_user_api_key(context, user_api_key):
context.user_api_key = user_api_key


@step(u'a user api key ')
def step_user_api_key(context):
context.user_api_key = None


@step(u'an OAI compatible chat completions request with an api error {api_error}')
def step_oai_chat_completions(context, api_error):
oai_chat_completions(context, context.user_prompt, api_error=api_error == 'raised')
context.user_api_key = None


@step(u'a prompt')
def step_a_prompt(context):
context.prompts.append(context.text)


@step(u'a prompt {prompt}')
def step_a_prompt_prompt(context, prompt):
context.prompts.append(prompt)


@step(u'concurrent completion requests')
def step_concurrent_completion_requests(context):
concurrent_requests(context, request_completion)
concurrent_requests(context, request_completion, context.n_predict, context.user_api_key)


@step(u'concurrent OAI completions requests')
def step_oai_chat_completions(context):
concurrent_requests(context, oai_chat_completions)
concurrent_requests(context, oai_chat_completions, context.user_api_key)


@step(u'all prompts are predicted')
Expand Down Expand Up @@ -168,7 +188,7 @@ def step_oai_compute_embedding(context):
def step_tokenize(context):
context.tokenized_text = context.text
response = requests.post(f'{context.base_url}/tokenize', json={
"content":context.tokenized_text,
"content": context.tokenized_text,
})
assert response.status_code == 200
context.tokens = response.json()['tokens']
Expand All @@ -181,49 +201,82 @@ def step_detokenize(context):
"tokens": context.tokens,
})
assert response.status_code == 200
print(response.json())
# FIXME the detokenize answer contains a space prefix ? see #3287
assert context.tokenized_text == response.json()['content'].strip()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BruceMacD @ggerganov Is it normal that the /detokenize enpoint prefix the content with a space ? see ##3287

Copy link
Owner

@ggerganov ggerganov Feb 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is expected - SPM tokenizer adds a whitespace prefix: google/sentencepiece#15

The test should expect both options - with or without the whitespace prefix



def concurrent_requests(context, f_completion):
@step(u'an OPTIONS request is sent from {origin}')
def step_options_request(context, origin):
options_response = requests.options(f'{context.base_url}/v1/chat/completions',
headers={"Origin": origin})
assert options_response.status_code == 200
context.options_response = options_response


@step(u'CORS header {cors_header} is set to {cors_header_value}')
def step_check_options_header_value(context, cors_header, cors_header_value):
assert context.options_response.headers[cors_header] == cors_header_value


def concurrent_requests(context, f_completion, *argv):
context.completions.clear()
context.completion_threads.clear()
for prompt in context.prompts:
completion_thread = threading.Thread(target=f_completion, args=(context, prompt))
completion_thread = threading.Thread(target=f_completion, args=(context, prompt, *argv))
completion_thread.start()
context.completion_threads.append(completion_thread)
context.prompts.clear()


def request_completion(context, prompt, n_predict=None):
response = requests.post(f'{context.base_url}/completion', json={
"prompt": prompt,
"n_predict": int(n_predict) if n_predict is not None else context.n_predict,
"seed": context.seed
})
assert response.status_code == 200
context.completions.append(response.json())
def request_completion(context, prompt, n_predict=None, user_api_key=None):
origin = "my.super.domain"
headers = {
'Origin': origin
}
if 'user_api_key' in context:
headers['Authorization'] = f'Bearer {user_api_key}'

response = requests.post(f'{context.base_url}/completion',
json={
"prompt": prompt,
"n_predict": int(n_predict) if n_predict is not None else context.n_predict,
"seed": context.seed
},
headers=headers)
if n_predict is not None and n_predict > 0:
assert response.status_code == 200
assert response.headers['Access-Control-Allow-Origin'] == origin
context.completions.append(response.json())
else:
assert response.status_code == 401


def oai_chat_completions(context, user_prompt):

def oai_chat_completions(context, user_prompt, api_error=None):
openai.api_key = context.user_api_key
openai.api_base = f'{context.base_url}/v1/chat'
chat_completion = openai.Completion.create(
messages=[
{
"role": "system",
"content": context.system_prompt,
},
{
"role": "user",
"content": user_prompt,
}
],
model=context.model,
max_tokens=context.n_predict,
stream=context.enable_streaming,
seed=context.seed
)
try:
chat_completion = openai.Completion.create(
messages=[
{
"role": "system",
"content": context.system_prompt,
},
{
"role": "user",
"content": user_prompt,
}
],
model=context.model,
max_tokens=context.n_predict,
stream=context.enable_streaming,
seed=context.seed
)
except openai.error.APIError:
if api_error:
openai.api_key = context.api_key
return
openai.api_key = context.api_key
if context.enable_streaming:
completion_response = {
'content': '',
Expand Down
1 change: 1 addition & 0 deletions examples/server/tests/tests.sh
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ set -eu
--threads-batch 4 \
--embedding \
--cont-batching \
--api-key llama.cpp \
"$@" &

# Start tests
Expand Down
Loading