server : PoC implementation of "interim" server #13400

ngxson · 2025-05-09T09:21:21Z

This PR acts as a PoC to illustrate my idea in #13367

The way it works is to spawn an "interim" server that exposes /load endpoint.

For example:

# run server without specifying model
llama-server

# then, load it via API
curl --header "Content-Type: application/json" \
  --request POST \
  --data '{"hf_repo": "ggml-org/gemma-3-4b-it-GGUF"}' \
  http://localhost:8080/load

The implementation separates run_interim_server and run_main_server because the run_main_server can be converted to creating a child process, though I'm not sure if this is preferable way to go.

WDYT about this approach @ggerganov @slaren ?

ggerganov · 2025-05-09T10:37:31Z

Nice.

Maybe the interim API should also have a logic to route main API requests to the respective server based on the model id. This way 3rd party apps can always communicate with a single network port.

ngxson · 2025-05-09T12:05:59Z

Yes that can be a good idea. I'm thinking about abstract out the HTTP server implementation, so we can implement the routing logic more easily.

In anyway, I think separating the HTTP layer and handler code will be one of our main goal in the very short term, before we can even do anything else. The problem is that server.cpp currently takes 30 seconds to compile, which make development not very pleasant 😂

isaac-mcfadyen · 2025-05-11T18:57:10Z

Maybe the interim API should also have a logic to route main API requests to the respective server based on the model id. This way 3rd party apps can always communicate with a single network port.

In case it helps, the way llama-swap does this (it's a third-party tool with a similar idea) is by adding an endpoint to "pass through" the request to the model named in the path.

I.e to route to the model with ID gemma-3-4b-it-GGUF (loading it if needed) :

curl -X POST http://127.0.0.1:8080/upstream/gemma-3-4b-it-GGUF/v1/chat/completions # etc...

server : PoC implementation of "interim" server

a3eb12f

ngxson marked this pull request as draft May 9, 2025 09:21

github-actions bot added examples server labels May 9, 2025

ngxson mentioned this pull request May 10, 2025

Break down main function in llama-server #13425

Closed

ngxson mentioned this pull request Oct 9, 2025

Feature request: allow load/unload models on server #16487

Open

jakexcosme mentioned this pull request Oct 22, 2025

Feature request: allow load/unload models on server COG-GTM/llama.cpp#70

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

server : PoC implementation of "interim" server #13400

server : PoC implementation of "interim" server #13400

ngxson commented May 9, 2025

Uh oh!

ggerganov commented May 9, 2025

Uh oh!

ngxson commented May 9, 2025

Uh oh!

isaac-mcfadyen commented May 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

server : PoC implementation of "interim" server #13400

Are you sure you want to change the base?

server : PoC implementation of "interim" server #13400

Conversation

ngxson commented May 9, 2025

Uh oh!

ggerganov commented May 9, 2025

Uh oh!

ngxson commented May 9, 2025

Uh oh!

isaac-mcfadyen commented May 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants