Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server: reorganize some http logic #5939

Merged
merged 9 commits into from
Mar 9, 2024
Merged

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Mar 8, 2024

Motivation

From the beginning, server routes are registered inline, for example:

svr.Post("/completions", [&ctx_server](const httplib::Request & req, httplib::Response & res) {
   ...
});

This is OK when we have small number of routes. But as the code of server grows in complexity, we started having more routes, even multiple routes that shared the same handler (like /chat/completions and /v1/chat/completions).

I propose in the PR to separate the part where we define the handler function and the route, which is inspired by modern web backend framework that have controllers in one place and routes in another place.

This helps us to better in keeping track of number of routes in the future.

Here is a list of routes that we currently support:

    svr.Get("/", handle_static_file(index_html, index_html_len, "text/html; charset=utf-8"));
    svr.Get("/index.js", handle_static_file(index_js, index_js_len, "text/javascript; charset=utf-8"));
    svr.Get("/completion.js", handle_static_file(completion_js, completion_js_len, "text/javascript; charset=utf-8"));
    svr.Get("/json-schema-to-grammar.mjs", handle_static_file(
        json_schema_to_grammar_mjs, json_schema_to_grammar_mjs_len, "text/javascript; charset=utf-8"));

    // register API routes
    svr.Get ("/health",              handle_health);
    svr.Get ("/slots",               handle_slots);
    svr.Get ("/metrics",             handle_metrics);
    svr.Get ("/props",               handle_props);
    svr.Get ("/v1/models",           handle_models);
    svr.Post("/completion",          handle_completions); // legacy
    svr.Post("/completions",         handle_completions);
    svr.Post("/v1/completions",      handle_completions);
    svr.Post("/chat/completions",    handle_chat_completions);
    svr.Post("/v1/chat/completions", handle_chat_completions);
    svr.Post("/infill",              handle_infill);
    svr.Post("/embedding",           handle_embeddings); // legacy
    svr.Post("/embeddings",          handle_embeddings);
    svr.Post("/v1/embeddings",       handle_embeddings);
    svr.Post("/tokenize",            handle_tokenize);
    svr.Post("/detokenize",          handle_detokenize);

In this PR

  • Disable static assets path by default. We already have all the JS/HTML files embedded in the compiled binary. I feel like leaving static path enabled by default is maybe a security hazard ==> it's now disabled by default
  • Move validate_api_key to a real middleware (using set_pre_routing_handler)
  • Break the HTTP code into dedicated sections: Middlewares - Handlers - Routes
  • Merge embedding and /v1/embeddings handler
  • De-duplicate some codes in chunked_content_provider ==> do it later, to avoid conflicts
  • Move json::parse to a middleware ==> don't know yet how to do, or will it worth the effort ==> will see later in another PR

Also fix a bug when API key has less than 4 characters

@ngxson ngxson marked this pull request as draft March 8, 2024 11:31
@ngxson ngxson marked this pull request as ready for review March 8, 2024 14:00
@ngxson ngxson requested review from ggerganov and phymbert March 8, 2024 14:00
// process all prompts
json responses = json::array();
for (auto & prompt : prompts) {
// TODO @ngxson : maybe support multitask for this endpoint?
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggerganov Quick question: Should we use multi task for multiple input prompts in embedding endpoint?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, probably a good idea in order to occupy less slots during processing

@ngxson ngxson changed the title Server: de-duplicate some http logic Server: reorganize some http logic Mar 8, 2024
@FSSRepo
Copy link
Collaborator

FSSRepo commented Mar 8, 2024

I didn't really like the change in the server logs; they seem too verbose now that they're in JSON format. There should be an option to display them in a more understandable way when debugging in the terminal, even finding the entry endpoint is frustrating.

{"function":"initialize","level":"INFO","line":426,"msg":"initializing slots","n_slots":1,"tid":"5612","timestamp":1709914159}
{"function":"initialize","level":"INFO","line":438,"msg":"new slot","n_ctx_slot":512,"slot_id":0,"tid":"5612","timestamp":1709914159}
{"function":"main","level":"INFO","line":3040,"msg":"model loaded","tid":"5612","timestamp":1709914159}
{"function":"main","hostname":"127.0.0.1","level":"INFO","line":3509,"msg":"HTTP server listening","n_threads_http":"11","port":"8080","tid":"5612","timestamp":1709914159}
{"function":"update_slots","level":"INFO","line":1574,"msg":"all slots are idle and system prompt is empty, clear the KV cache","tid":"5612","timestamp":1709914159}
{"function":"log_server_request","level":"INFO","line":2737,"method":"GET","msg":"request","params":{},"path":"/","remote_addr":"127.0.0.1","remote_port":52626,"status":200,"tid":"12236","timestamp":1709914168}
{"function":"log_server_request","level":"INFO","line":2737,"method":"GET","msg":"request","params":{},"path":"/index.js","remote_addr":"127.0.0.1","remote_port":52626,"status":200,"tid":"12236","timestamp":1709914168}
{"function":"log_server_request","level":"INFO","line":2737,"method":"GET","msg":"request","params":{},"path":"/completion.js","remote_addr":"127.0.0.1","remote_port":52627,"status":200,"tid":"22724","timestamp":1709914168}
{"function":"log_server_request","level":"INFO","line":2737,"method":"GET","msg":"request","params":{},"path":"/json-schema-to-grammar.mjs","remote_addr":"127.0.0.1","remote_port":52628,"status":200,"tid":"22540","timestamp":1709914168}
{"function":"launch_slot_with_data","level":"INFO","line":819,"msg":"slot is processing task","slot_id":0,"task_id":0,"tid":"5612","timestamp":1709914171}
{"function":"update_slots","ga_i":0,"level":"INFO","line":1812,"msg":"slot progression","n_past":0,"n_past_se":0,"n_prompt_tokens_processed":112,"slot_id":0,"task_id":0,"tid":"5612","timestamp":1709914171}
{"function":"update_slots","level":"INFO","line":1836,"msg":"kv cache rm [p0, end)","p0":0,"slot_id":0,"task_id":0,"tid":"5612","timestamp":1709914171}
{"function":"print_timings","level":"INFO","line":260,"msg":"prompt eval time     =   28783.38 ms /   112 tokens (  256.99 ms per token,     3.89 tokens per second)","n_prompt_tokens_processed":112,"n_tokens_second":3.891134127323884,"slot_id":0,"t_prompt_processing":28783.382,"t_token":256.9944821428572,"task_id":0,"tid":"5612","timestamp":1709914203}
{"function":"print_timings","level":"INFO","line":274,"msg":"generation eval time =    3490.47 ms /    12 runs   (  290.87 ms per token,     3.44 tokens per second)","n_decoded":12,"n_tokens_second":3.4379334123867027,"slot_id":0,"t_token":290.87241666666665,"t_token_generation":3490.469,"task_id":0,"tid":"5612","timestamp":1709914203}
{"function":"print_timings","level":"INFO","line":283,"msg":"          total time =   32273.85 ms","slot_id":0,"t_prompt_processing":28783.382,"t_token_generation":3490.469,"t_total":32273.851000000002,"task_id":0,"tid":"5612","timestamp":1709914203}
{"function":"update_slots","level":"INFO","line":1644,"msg":"slot released","n_cache_tokens":124,"n_ctx":512,"n_past":123,"n_system_tokens":0,"slot_id":0,"task_id":0,"tid":"5612","timestamp":1709914203,"truncated":false}
{"function":"log_server_request","level":"INFO","line":2737,"method":"POST","msg":"request","params":{},"path":"/completion","remote_addr":"127.0.0.1","remote_port":52628,"status":200,"tid":"22540","timestamp":1709914203}

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 8, 2024

@FSSRepo There's log format option: https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md?plain=1#L60

Copy link
Collaborator

@phymbert phymbert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! The code becomes better and better to work with

@phymbert
Copy link
Collaborator

phymbert commented Mar 8, 2024

@ngxson mind that the test security.feature looks broken

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 8, 2024

@phymbert Thanks for noticing. I believe I accidentally found a bug here: The server in the test case is launched with an API key, but the scenario doesn't use that API key. I needed to modify both cpp code + test script to make it work. Could you please have a look? Thanks! 1a7c5fd

@phymbert
Copy link
Collaborator

phymbert commented Mar 8, 2024

@phymbert Thanks for noticing. I believe I accidentally found a bug here: The server in the test case is launched with an API key, but the scenario doesn't use that API key. I needed to modify both cpp code + test script to make it work. Could you please have a look? Thanks! 1a7c5fd

no worries, the test feature is called security, but it is quite ambitious at this stage.

// process all prompts
json responses = json::array();
for (auto & prompt : prompts) {
// TODO @ngxson : maybe support multitask for this endpoint?
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, probably a good idea in order to occupy less slots during processing

@ggerganov
Copy link
Owner

@ngxson Maybe you should merge it as it is to avoid resolving conflicts and then implement the embeddings multitask in a separate PR

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 9, 2024

@ggerganov OK I'll merge it now without waiting for the CI.

I also ran the test in local and it pass, so we can be confident that the CI will pass.

@ngxson ngxson merged commit 950ba1a into ggerganov:master Mar 9, 2024
54 of 59 checks passed
hazelnutcloud pushed a commit to hazelnutcloud/llama.cpp that referenced this pull request Mar 10, 2024
* refactor static file handler

* use set_pre_routing_handler for validate_api_key

* merge embedding handlers

* correct http verb for endpoints

* fix embedding response

* fix test case CORS Options

* fix code style
NeoZhangJianyu pushed a commit to NeoZhangJianyu/llama.cpp that referenced this pull request Mar 12, 2024
* refactor static file handler

* use set_pre_routing_handler for validate_api_key

* merge embedding handlers

* correct http verb for endpoints

* fix embedding response

* fix test case CORS Options

* fix code style
jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Mar 13, 2024
* refactor static file handler

* use set_pre_routing_handler for validate_api_key

* merge embedding handlers

* correct http verb for endpoints

* fix embedding response

* fix test case CORS Options

* fix code style
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
* refactor static file handler

* use set_pre_routing_handler for validate_api_key

* merge embedding handlers

* correct http verb for endpoints

* fix embedding response

* fix test case CORS Options

* fix code style
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants