Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server: format error to json #5961

Merged
merged 13 commits into from
Mar 11, 2024
Merged

Server: format error to json #5961

merged 13 commits into from
Mar 11, 2024

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Mar 9, 2024

This PR migrate some changes from #5776 and #5710

Errors are now formatted to JSON as suggested in #5882 (comment)

Examples

Unhandled exceptions (for example, invalid input JSON)

{
    "error": {
        "code": 500,
        "message": "[json.exception.parse_error.101] parse error at line 3, column 20: syntax error while parsing object - unexpected ':'; expected '}'",
        "type": "server_error"
    }
}

404 not found

{
    "error": {
        "code": 404,
        "message": "File Not Found",
        "type": "not_found_error"
    }
}

API key incorrect

{
    "error": {
        "code": 401,
        "message": "Invalid API Key",
        "type": "authentication_error"
    }
}

Endpoint is disabled

{
    "error": {
        "code": 501,
        "message": "This server does not support metrics endpoint.",
        "type": "not_supported_error"
    }
}

Invalid grammar

{
    "error": {
        "code": 400,
        "message": "Failed to parse grammar",
        "type": "invalid_request_error"
    }
}

Missing both "input" and "content" for /embeddings

{
    "error": {
        "code": 400,
        "message": "\"input\" or \"content\" must be provided",
        "type": "invalid_request_error"
    }
}

Missing both "prompt" and "messages" for /completions

{
    "error": {
        "code": 400,
        "message": "Either \"prompt\" or \"messages\" must be provided",
        "type": "invalid_request_error"
    }
}

NOTE:

  • Empty prompt for completions / embeddings is not considered as error (since empty prompts will still have BOS token by default)
  • The error code is quite messed up now (i.e. sometimes it should be 400 instead of 500). But for now I don't have a better way to receive errors inside run_slots(). This may be fixed in the future.

@ngxson ngxson marked this pull request as ready for review March 9, 2024 14:58
@ngxson ngxson requested review from phymbert and ggerganov March 9, 2024 14:58
@ngxson
Copy link
Collaborator Author

ngxson commented Mar 9, 2024

Small note for @phymbert : Since we're using the correct OpenAI format now, the python test code will need to use correct exception classes - no actions needed from your side now, but just for reminding.

examples/server/utils.hpp Outdated Show resolved Hide resolved
examples/server/server.cpp Outdated Show resolved Hide resolved
examples/server/server.cpp Outdated Show resolved Hide resolved
@z80maniac
Copy link
Contributor

Empty prompt for completions / embeddings is not considered as error (since empty prompts will still have BOS token by default)

BOS token is only added when a set of conditions is satisfied. From the docs:

 A BOS token is inserted at the start, if all of the following conditions are true:

- The prompt is a string or an array with the first element given as a string
- The model's `tokenizer.ggml.add_bos_token` metadata is `true`
- The system prompt is empty

For example, as i showed here, an empty prompt specified as an array always returns an empty response:

❯ curl -sS --data '{"prompt": [], "n_predict": 4}' http://127.0.0.1:8080/completion | jq .content
""

I think it's counter-intuitive, and instead an error should be returned in this case.

@z80maniac
Copy link
Contributor

This PR breaks the web UI error handling, so I think that all the changes in web UI from my PR (in completion.js and completion.js.hpp) can be moved here, since the error format remained the same.

Also, the README part can be taken from there as well. Though the formatting is a little bit confusing there (the "where" list blends with the "POST" section below), maybe needs to be done differently.

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 9, 2024

@z80maniac Thanks for the suggestions, I took the completions.js file (and hpp) from your PR.

For the README, I added a dedicated section to specify that we use the same format as OpenAI. Only llama.cpp specific errors (for example: /slots, /metrics, grammar) will be specified in the docs.

examples/server/utils.hpp Show resolved Hide resolved
@@ -1580,7 +1595,7 @@ struct server_context {
queue_results.send(result);
}

bool update_slots() {
void run_slots() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO update_slots was clearer for me, the slots are not run, but updated from batch view.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should not get ride of the return type. Returning false might interrupt the main loop for example

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, the main loop only stop if running is set to false. So if the loop stop, we know exactly why.

Adding ability for update_slots to exit the main loop may make it harder to debug if there is an error inside update_slots

@@ -2035,20 +2053,20 @@ struct server_context {
n_batch /= 2;
i -= n_batch;

continue;
continue; // continue loop of n_batch
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest // continue batch decoding loop

Copy link
Collaborator Author

@ngxson ngxson Mar 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, the loop is to retry with llama_decode with smaller n_batch, until it successful.

For now I don't know how to rephrase the idea, so I just leave it here to distinguish it from the continue and break statements below. Since here we're inside a nested loop, I think having comments will be easier to navigate.

Or maybe we don't need these comments at all, do you have any input for that @ggerganov ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decide to keep the comment this way, because the goal is to reflect the variable name, which helps when we navigate in a nested loop (but not to explain what the code does).

If you're ok with that, then I'll merge this PR @phymbert

}

for (auto & slot : slots) {
if (slot.state != SLOT_STATE_PROCESSING || slot.i_batch < (int) i || slot.i_batch >= (int) (i + n_tokens)) {
continue;
continue; // continue loop of slots
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest // continue updating next token of next slots or something similar

examples/server/README.md Outdated Show resolved Hide resolved
@z80maniac
Copy link
Contributor

since the error format remained the same.

Turns out, while the error format did not change, its delivery to the web UI did. Now the error response is not wrapped in content field. Not sure if it's supposed to be this way, but anyway, here's a diff that fixes the web UI and also adds support for the code field: webui_fix.txt (can't attach *.diff)

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 10, 2024

@z80maniac Thanks, it's updated in 0e72879

@ngxson ngxson merged commit caa106d into ggerganov:master Mar 11, 2024
60 checks passed
NeoZhangJianyu pushed a commit to NeoZhangJianyu/llama.cpp that referenced this pull request Mar 12, 2024
* server: format error to json

* server: do not crash on grammar error

* fix api key test case

* revert limit max n_predict

* small fix

* correct coding style

* update completion.js

* launch_slot_with_task

* update docs

* update_slots

* update webui

* update readme
jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Mar 13, 2024
* server: format error to json

* server: do not crash on grammar error

* fix api key test case

* revert limit max n_predict

* small fix

* correct coding style

* update completion.js

* launch_slot_with_task

* update docs

* update_slots

* update webui

* update readme
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
* server: format error to json

* server: do not crash on grammar error

* fix api key test case

* revert limit max n_predict

* small fix

* correct coding style

* update completion.js

* launch_slot_with_task

* update docs

* update_slots

* update webui

* update readme
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants