Skip to content

Conversation

@ngxson
Copy link
Collaborator

@ngxson ngxson commented May 9, 2025

This PR acts as a PoC to illustrate my idea in #13367

The way it works is to spawn an "interim" server that exposes /load endpoint.

For example:

# run server without specifying model
llama-server

# then, load it via API
curl --header "Content-Type: application/json" \
  --request POST \
  --data '{"hf_repo": "ggml-org/gemma-3-4b-it-GGUF"}' \
  http://localhost:8080/load

The implementation separates run_interim_server and run_main_server because the run_main_server can be converted to creating a child process, though I'm not sure if this is preferable way to go.

WDYT about this approach @ggerganov @slaren ?

@ngxson ngxson marked this pull request as draft May 9, 2025 09:21
@ggerganov
Copy link
Member

Nice.

Maybe the interim API should also have a logic to route main API requests to the respective server based on the model id. This way 3rd party apps can always communicate with a single network port.

@ngxson
Copy link
Collaborator Author

ngxson commented May 9, 2025

Yes that can be a good idea. I'm thinking about abstract out the HTTP server implementation, so we can implement the routing logic more easily.

In anyway, I think separating the HTTP layer and handler code will be one of our main goal in the very short term, before we can even do anything else. The problem is that server.cpp currently takes 30 seconds to compile, which make development not very pleasant 😂

@isaac-mcfadyen
Copy link
Contributor

Maybe the interim API should also have a logic to route main API requests to the respective server based on the model id. This way 3rd party apps can always communicate with a single network port.

In case it helps, the way llama-swap does this (it's a third-party tool with a similar idea) is by adding an endpoint to "pass through" the request to the model named in the path.

I.e to route to the model with ID gemma-3-4b-it-GGUF (loading it if needed) :

curl -X POST http://127.0.0.1:8080/upstream/gemma-3-4b-it-GGUF/v1/chat/completions # etc...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants