-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Server: reorganize some http logic #5939
Conversation
// process all prompts | ||
json responses = json::array(); | ||
for (auto & prompt : prompts) { | ||
// TODO @ngxson : maybe support multitask for this endpoint? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ggerganov Quick question: Should we use multi task for multiple input prompts in embedding endpoint?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, probably a good idea in order to occupy less slots during processing
I didn't really like the change in the server logs; they seem too verbose now that they're in JSON format. There should be an option to display them in a more understandable way when debugging in the terminal, even finding the entry endpoint is frustrating. {"function":"initialize","level":"INFO","line":426,"msg":"initializing slots","n_slots":1,"tid":"5612","timestamp":1709914159}
{"function":"initialize","level":"INFO","line":438,"msg":"new slot","n_ctx_slot":512,"slot_id":0,"tid":"5612","timestamp":1709914159}
{"function":"main","level":"INFO","line":3040,"msg":"model loaded","tid":"5612","timestamp":1709914159}
{"function":"main","hostname":"127.0.0.1","level":"INFO","line":3509,"msg":"HTTP server listening","n_threads_http":"11","port":"8080","tid":"5612","timestamp":1709914159}
{"function":"update_slots","level":"INFO","line":1574,"msg":"all slots are idle and system prompt is empty, clear the KV cache","tid":"5612","timestamp":1709914159}
{"function":"log_server_request","level":"INFO","line":2737,"method":"GET","msg":"request","params":{},"path":"/","remote_addr":"127.0.0.1","remote_port":52626,"status":200,"tid":"12236","timestamp":1709914168}
{"function":"log_server_request","level":"INFO","line":2737,"method":"GET","msg":"request","params":{},"path":"/index.js","remote_addr":"127.0.0.1","remote_port":52626,"status":200,"tid":"12236","timestamp":1709914168}
{"function":"log_server_request","level":"INFO","line":2737,"method":"GET","msg":"request","params":{},"path":"/completion.js","remote_addr":"127.0.0.1","remote_port":52627,"status":200,"tid":"22724","timestamp":1709914168}
{"function":"log_server_request","level":"INFO","line":2737,"method":"GET","msg":"request","params":{},"path":"/json-schema-to-grammar.mjs","remote_addr":"127.0.0.1","remote_port":52628,"status":200,"tid":"22540","timestamp":1709914168}
{"function":"launch_slot_with_data","level":"INFO","line":819,"msg":"slot is processing task","slot_id":0,"task_id":0,"tid":"5612","timestamp":1709914171}
{"function":"update_slots","ga_i":0,"level":"INFO","line":1812,"msg":"slot progression","n_past":0,"n_past_se":0,"n_prompt_tokens_processed":112,"slot_id":0,"task_id":0,"tid":"5612","timestamp":1709914171}
{"function":"update_slots","level":"INFO","line":1836,"msg":"kv cache rm [p0, end)","p0":0,"slot_id":0,"task_id":0,"tid":"5612","timestamp":1709914171}
{"function":"print_timings","level":"INFO","line":260,"msg":"prompt eval time = 28783.38 ms / 112 tokens ( 256.99 ms per token, 3.89 tokens per second)","n_prompt_tokens_processed":112,"n_tokens_second":3.891134127323884,"slot_id":0,"t_prompt_processing":28783.382,"t_token":256.9944821428572,"task_id":0,"tid":"5612","timestamp":1709914203}
{"function":"print_timings","level":"INFO","line":274,"msg":"generation eval time = 3490.47 ms / 12 runs ( 290.87 ms per token, 3.44 tokens per second)","n_decoded":12,"n_tokens_second":3.4379334123867027,"slot_id":0,"t_token":290.87241666666665,"t_token_generation":3490.469,"task_id":0,"tid":"5612","timestamp":1709914203}
{"function":"print_timings","level":"INFO","line":283,"msg":" total time = 32273.85 ms","slot_id":0,"t_prompt_processing":28783.382,"t_token_generation":3490.469,"t_total":32273.851000000002,"task_id":0,"tid":"5612","timestamp":1709914203}
{"function":"update_slots","level":"INFO","line":1644,"msg":"slot released","n_cache_tokens":124,"n_ctx":512,"n_past":123,"n_system_tokens":0,"slot_id":0,"task_id":0,"tid":"5612","timestamp":1709914203,"truncated":false}
{"function":"log_server_request","level":"INFO","line":2737,"method":"POST","msg":"request","params":{},"path":"/completion","remote_addr":"127.0.0.1","remote_port":52628,"status":200,"tid":"22540","timestamp":1709914203} |
@FSSRepo There's log format option: https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md?plain=1#L60 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! The code becomes better and better to work with
no worries, the test feature is called security, but it is quite ambitious at this stage. |
// process all prompts | ||
json responses = json::array(); | ||
for (auto & prompt : prompts) { | ||
// TODO @ngxson : maybe support multitask for this endpoint? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, probably a good idea in order to occupy less slots during processing
@ngxson Maybe you should merge it as it is to avoid resolving conflicts and then implement the embeddings multitask in a separate PR |
@ggerganov OK I'll merge it now without waiting for the CI. I also ran the test in local and it pass, so we can be confident that the CI will pass. |
* refactor static file handler * use set_pre_routing_handler for validate_api_key * merge embedding handlers * correct http verb for endpoints * fix embedding response * fix test case CORS Options * fix code style
* refactor static file handler * use set_pre_routing_handler for validate_api_key * merge embedding handlers * correct http verb for endpoints * fix embedding response * fix test case CORS Options * fix code style
* refactor static file handler * use set_pre_routing_handler for validate_api_key * merge embedding handlers * correct http verb for endpoints * fix embedding response * fix test case CORS Options * fix code style
* refactor static file handler * use set_pre_routing_handler for validate_api_key * merge embedding handlers * correct http verb for endpoints * fix embedding response * fix test case CORS Options * fix code style
Motivation
From the beginning, server routes are registered inline, for example:
This is OK when we have small number of routes. But as the code of server grows in complexity, we started having more routes, even multiple routes that shared the same handler (like
/chat/completions
and/v1/chat/completions
).I propose in the PR to separate the part where we define the handler function and the route, which is inspired by modern web backend framework that have controllers in one place and routes in another place.
This helps us to better in keeping track of number of routes in the future.
Here is a list of routes that we currently support:
In this PR
validate_api_key
to a real middleware (usingset_pre_routing_handler
)embedding
and/v1/embeddings
handlerDe-duplicate some codes in==> do it later, to avoid conflictschunked_content_provider
Move==> will see later in another PRjson::parse
to a middleware ==> don't know yet how to do, or will it worth the effortAlso fix a bug when API key has less than 4 characters