PL-134151 Move to llama cpp inference #10

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Draft

elikoga wants to merge 123 commits into main from move-to-llama-cpp-inference

Member

elikoga commented Nov 26, 2025 •

edited

Loading

PL-134151

elikoga added 2 commits

November 26, 2025 01:08


          add inference module, entrypoint

8f1cee1


          feat: implement model management with download and load into gpu

5f71d8a

ctheune reviewed

View reviewed changes

src/skvaider/inference/manager.py Outdated Show resolved Hide resolved

ctheune reviewed

View reviewed changes

src/inference/manager.py Outdated Show resolved Hide resolved

ctheune reviewed

View reviewed changes

src/skvaider/inference/manager.py Outdated Show resolved Hide resolved

ctheune reviewed

View reviewed changes

src/inference/manager.py Outdated Show resolved Hide resolved

ctheune reviewed

View reviewed changes

src/inference/manager.py Outdated Show resolved Hide resolved

elikoga added 12 commits

December 2, 2025 15:59


          add code review changes

d7b029b


          feat: dynamic port

45563ab


          move inference module

b3ae3f8


          feat: implement model unloading functionality and API endpoint

f8005ba


          integrate with skvaider

53ce3a1


          refactor: remove ModelManager call in skvaider and run inference in a…

db9bb68

… server


          feat: update model configurations and enhance test assertions for emb…

20a48d7

…eddings


          feat: update inference server port handling and improve test endpoint…

0aaa63b

… URLs


          feat: readd support for Ollama backend and parameterize model names i…

3bde3be

…n tests


          feat: add health check for backends during lifespan test


          change http error code, re-add ollama to lifespan

315d6da


          feat: refactor openai proxy file structure

31ee225

elikoga changed the title ~~Move to llama cpp inference~~ PL-134151 Move to llama cpp inference

elikoga added 10 commits

January 13, 2026 19:42


          feat: update ModelManager initialization to use models directory from…

1e2abc6

… environment variable


          feat: rename load endpoint to get_running_model_or_load and update re…

9953f5a

…ferences


          feat: update download_model to use models directory from ModelManager

e9c3a46


          make /download endpoint input format more aligned with other api

7edb8a8


          unset default for context_size

c6cb0d8


          move health endpoint to /manager/health , fix tests

85d854e


          add filename to model configuration in test lifespan

0f4a6d5


          add proxy request endpoint to interact with models

fc1c515


          add filename to model configuration in test lifespan

074f161


          update backend configuration to use 'ollama' and adjust health check …

1f2c95f

…endpoints

elikoga and others added 30 commits

January 29, 2026 15:11


          increase test_embeddinggemma_output_stability timeout for GH actions

5a7c1df


          increase test timeouts for github actions

b9db5e5


          Normalize model names to lowercase in inference endpoints and configu…

8b3987c

…ration


          update llama-cpp to remove mentions of gpt-3.5-turbo in the output

507d1b9


          Change embeddinggemma output stability test to validate embedding val…

1c11876

…ues with tolerance


          add monitoring for vram usage

7a41f9a

provide a CPU backend that checks regular memory usage for
development environments


          give CLAUDE.md a try

8a5c657


          extend memory management to allow inspecting real model usage

08664d6

track per host and per model usage for different backends (RAM, rocm)

introduce a task manager to unifi starting and cleaning up ongoing
background tasks


          improve logging for memory usage info

e2160f2


          use new memory calculations for placing models on backends

9c3cbe5

- also some asyncio cleanups,
- more use of the task manager,
- a small fix to ensure model health is correctly updated on the gateway
when an inference server restarts
- improved logging


          inference: improve recovery from timeouts when loading models

we might have been leaking processes before. i couldn't quite
prove why this fixes it, but i'm not leaking processes on my
machine now. will have to check in a real environment.


          proxy: serialise loading models per backend.

5c6d5df

this gives a chance that we don't load multiple models in parallel
which then can cause overload.


          increase model loading timeout

11f6cfd


          asyncio: clean up task management and support unique/dedup tasks

4d51ec0

various model handling tasks are now unique and more easily
cleaned up. removes complexity of various approaches to deal
with tasks and their lifecycles


          add info that this is a typed package to remove missing stub warnings

9255d13


          improve logging and fix a logging error

7ec49fb


          document the task manager a bit

a341177


          rename "warmup" to "reserved"

f6f9505


          rename "load_model_with_options" to "load_model"

d50e03c

This was a naming decision from earlier experiments. The "with options"
does not add value any longer.


          proxy: don't try loading a model if the fitness has dropped to 0

db8d7b7


          proxy: first implementation of automatically unloading models

1ebd205


          proxy: wrap up unloading models on demand and also add test coverage

40e2c57


          proxy: implement backend availability check and retry logic

6f0b7cf


          inference: ensure models can't be unloaded while being used

90e0e29


          fix: handle case where task is already removed in cleanup callback

04b73cb


          fix: update proxy endpoint to return 540 status code for unavailable …

27768ed

…models and add corresponding tests


          fix: ensure proper shutdown of fake llama server to avoid blocking

fcaee29


          prevent KeyError in cleanup callback by using pop with default

41936b5


          tests: add wait_for_models_active function to ensure model instances …

0d0dc61

…are active


          Add metrics endpoint

1b00012

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet