-
Notifications
You must be signed in to change notification settings - Fork 589
Smart cache (RAM context cache) #1851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: concedo_experimental
Are you sure you want to change the base?
Conversation
|
"Hi @LostRuins , I've opened this draft PR as a functional proof-of-concept for the Smart Cache feature. As we discussed, I'd really appreciate your feedback and any help you can offer to refine it before I mark it as ready for a formal review. Thank you!" |
|
@LostRuins this PR is ready to be reviewed While implementing the Smart Cache feature, I noticed the //No dynamic memory allocation! Setup structs with FIXED (known) shapes and sizes My implementation has a few deviations from these strict rules: Current Choices:
These choices favor:
vs strict compliance which would require:
Question: Are these pragmatic deviations acceptable, or would you prefer |
|
Thanks @wbruna! All three points addressed:
Changes pushed. Ready for re-review! |
|
If in the future this ends up being a half useful feature rather than fully useful due to the potential of needing a LOT more sysRAM, I suppose one option is to specify a maximum context size the user may send over before it gets either truncated before generating the Ctx K-V database, or entirely skipped with the option of reporting this back to the user? My phrasing isn't exactly the most eloquent, but this is what I can muster for linguistics at this time. The assertion of half/full usefulness isn't a dig at you, but rather about what the user would find as a barrier to entry/usage. In any case, with all the newer storage methods and RAM speeds and all that are around lately, the need to actually copy it back to VRAM might be almost redundant depending on the latency introduced by the process of loading it into vram. System ram is blazing fast, and I can attest I have in many situations preferred fully manually offloading KV cache to sysRAM to load all the layers on the GPU because the non-framentation is much more effective. In my case, it's a bit slow albeit still faster than framentation across a GTX1070 & I7 7700K. But that just stands to prove that there is most definitely room for "play" and nonstandard ways of holding onto that KV cache. As to how the faster storage I mentioned comes to play in this: Some NVMe storage devices come quite scary close to the total roundtrip time you'd find for RAM access given the possibly easier ability to access NVMe directly over the bus. That's... in essence no different than the concept of directly attaching your storage to your PCIe bus for that gaming use case. It makes use of the exact justifications and reasons I'm bringing up now. That said, I'm not knowledgeable enough to actually have a firm grasp on what the sort of latency figures related to this all turn out to. I'm merely keeping in mind that there is those extra layers to traverse, in the end. That's especially true for Python, even if cPython is really impressive these days. I'm following this, I'm curious what we end up with for christmas this year! |
|
Consumer gaming PCs have 32 GB ram, sometimes 64 GB (like i do), if you don't use mmap to load the model, most of it is free, unless you use MoE experts on CPU or stuff like that. it's with huge context that this feature truly shines I'm testing this feature using 48GB smart cache, with both AIHorde worker (around 9% cache hit rate, 36h), waidrin (https://github.com/p-e-w/waidrin) and SillyTavern. I can really notice the difference. About the "speed" of moving data RAM <-> VRAM, well.. even with a DDR3 RAM you would still get better speeds then having to preprocess thousands of tokens in case of cache hit. About Storage hierarchy and NVMe speeds, You're absolutely right—modern NVMe (especially Gen4/5) has dramatically arrowed the gap to DRAM latency. The challenge with KV cache specifically is the frequency of access during generation (every token decode touches it), so even 50-100µs NVMe latency vs ~100ns RAM adds up fast. That said, for hibernated slots that aren't actively generating, NVMe could be brilliant—kind of a tiered cache (VRAM → RAM → NVMe). The current implementation keeps everything in RAM because we're using llama.cpp's save_state_kv(), which serializes to a memory buffer. Extending this to NVMe would need a custom serialization path that bypasses the buffer, but it's technically doable. About Direct VRAM offload vs fragmentation, Your GTX 1070 experience is a perfect example—sometimes predictable RAM latency beats fragmented VRAM/split execution. The smart cache sits in a sweet spot for that: it keeps the working set in VRAM while letting you maintain a much larger "recently used" pool in RAM without OOM-ing the GPU. |
|
Hi @Pento95 thanks for your PR. Sorry I didn't get to it earlier, been busy. I took a look at this, and that's an massive amount of code you're adding 😅 . I do like the idea, but (and no offense), don't really like this implementation, especially the changes to koboldcpp.py. Ideally koboldcpp.py should not be managing cached context contents at all - that's so far always transparently handled by the backend (like how context shifting or fast forward is completely transparent to the python side, which simply sends prompts). I think the segment about automatic LRU eviction based on RAM and variable slot counts is also not necessary mainly due to the high complexity. We actually already have the functionality we can hook into for loading and saving states - so the whole side slot system is a little redundant, the goal is just to automate it (it's currently triggered manually) Okay... so let's take a step back. Your core idea is solid. Goal: Allow automatic save and load KV states for reducing reprocessing in cases where we juggle between a small set of repeated prompts (like AI Horde usage). It doesn't actually have much to do with VRAM, we're just reusing the KV state from a past previous prompt. Suggested simple approach:
This should be way simpler and only requires adding 3 new functions, out of which 2 of them are just helper functions for existing code. It does not even need to touch koboldcpp.py except for adding one flag+checkbox that enables/disables smart cache. And it should work as you intended. Thoughts? |
|
Just my $0.02 on a specific point:
At least customizing the number of smart cache slots could be allowed without additional complexity: the algorithm would work pretty much the same, and it could still be controlled by a single config (<=0 to disable, >=1 for the number of available slots). |
Sure, that can work |
|
I'll work on it right away! except for smartcacherammaxsize, i would like to keep that, it makes the feature more user-friendly and.. it works, so.. i'll move it to c++ too. |
|
The problem with using the RAM size thing is it makes a lot of assumptions about the system which I am not sure is a good thing. Besides your LLM software (which takes up a very variable amount of memory, depending on your GPU setup and launch options), there are also interactions with other things running on system (which themselves can vary in ram usage too, for example if you run 2 koboldcpp instances, or one and some other AI image gen tool etc) I'd much prefer wbruna's approach of specifying a save state count, which will have a known max memory footprint you can easily tweak Also getting RAM cross platform is... not reliably easy. Kcpp can run on linux x64, linux ARM, macos, termux android, windows... there will be unnecessary trouble getting it working reliably on them all. |
|
@LostRuins thanks for the feedback. I see your point regarding the reliability of querying system RAM across different platforms (Termux, Docker, etc.). I agree that get_system_ram() introduces unnecessary complexity and potential instability. Proposal: I will remove the OS-level RAM validation (detection included) logic entirely. However, I strongly advocate keeping the logic based on Size (GB) rather than Slots (Count), because "Slots" are extremely hard for users to estimate. The size of a context state varies wildly depending on:
Asking users to calculate "how many slots fit in my RAM" given these variables is bad UX. The Plan:
This removes the "unreliable cross-platform code" you are concerned about, while keeping the feature intuitive. If the user sets a limit higher than their actual physical RAM, the OS will simply handle it (swap/OOM) as it does with any other application, as it would with slots approach. If you agree, I'll push these changes (removing the validation) ASAP. EDIT: i already removed the "get RAM" function |
|
to me, slots are more intuitive than size because the user doesn't know how much size one KV state takes. if i want to allocate enough space for 3 of my friends to use kobold together, how much space do i need for our 4 slots? 10GB? 20GB? 500GB? it would be easier to just let them say "4 slots" and then allocate 4 slots for the 4 people to use. slots is just "you can keep this number of unique states before something gets removed". It's simple and understandable. |
|
@LostRuins I agree with your vision: for scenarios with a fixed number of actors (e.g., hosting 3 friends), defining Slots is definitely the most intuitive approach. However, I propose supporting both limits (defaulting to 0/Disabled) because there are distinct valid use cases for each:
Proposed Implementation: Logic: The feature activates if either is > 0. Eviction happens if: (slots > 0 && current_slots > slots) || (ram_limit > 0 && current_size > ram_limit). This enables the "Simple 3 friends" setup you want, while also safely supporting the dynamic/heavy scenarios without complex OS-level RAM detection. However, if you feel this dual approach adds too much complexity and prefer to strictly stick to a "Slots only" implementation, just let me know and I will proceed to remove the RAM limit logic entirely. |
Perhaps the fact you even question this, is an indicator it might be best to keep them separate but interworking, and have two PR's instead? That'd let you both keep the idea and changes presented while not hampering the other half of your ideas. A perhaps sometimes necessary reminder: You're not restricted in how many PR's you can make. You don't need cram everything into one. :) |
|
Has there been any consideration regarding ui implementation for any of
this? Both desktop / terminal ui as well as the Lite webui. Don't forget
the config file bits either.
…On Mon, 24 Nov 2025, 6:55 pm Pento, ***@***.***> wrote:
*Pento95* left a comment (LostRuins/koboldcpp#1851)
<#1851 (comment)>
@LostRuins <https://github.com/LostRuins> I agree with your vision: for
scenarios with a fixed number of actors (e.g., hosting 3 friends), defining
Slots is definitely the most intuitive approach.
However, I propose supporting both limits (defaulting to 0/Disabled)
because there are distinct valid use cases for each:
- Fixed Workflows (Slots): Multiplayer chat, or frontends like
SillyTavern that fork the context temporarily (e.g., for Image Generation
prompts), requiring just "1 extra slot" to avoid a VRAM miss.
- Throughput/Resource Workflows (RAM Cap): AI Horde workers trying to
maximize cached states, or RPG frontends (like waidrin) that generate many
small, diverse descriptions. In these cases, context sizes vary wildly (2k
vs 32k), making "Slots" an unreliable metric for preventing OOM.
Proposed Implementation:
--smart-cache-slots: Default 0 (Disabled).
--smart-cache-ram-max-size: Default 0 (Disabled).
Logic: The feature activates if either is > 0. Eviction happens if: (slots
> 0 && current_slots > slots) || (ram_limit > 0 && current_size >
ram_limit).
This enables the "Simple 3 friends" setup you want, while also safely
supporting the dynamic/heavy scenarios without complex OS-level RAM
detection.
However, if you feel this dual approach adds too much complexity and
prefer to strictly stick to a "Slots only" implementation, just let me know
and I will proceed to remove the RAM limit logic entirely.
—
Reply to this email directly, view it on GitHub
<#1851 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AD32W6BRNVMVGOUOI47OKB336NBC7AVCNFSM6AAAAACMI37K4WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTKNZSGA2DAOJVGM>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
|
I think let's keep it simple and do slots only for now. we can always do a RAM converter -> slot count in future. |
|
On it! |
wbruna
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how much will still be changed, so just a few comments for now.
| } | ||
| g_smart_cache_manager = new SmartCache::SmartCacheManager(max_ram_gb); | ||
| g_smart_cache_metrics.reset(); | ||
| return g_smart_cache_manager; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused about the responsibility of the 'expose' module here: I'm more used to the image generation side, and there it's just a layer to interface with the Python code. But here it's... also exposing global vars between C++ modules?
Anyway: I don't think it's correct to return this here as a void* in any case. I don't think there is a need to get this pointer back to the Python level: from Python's POV, it's simply an internal detail of the C++ code, and it could be controlled simply by configuration:
- set >0 slots: either create the manager, or reconfigure it to the new number
- set ==0 slots: release the manager
I'd go further and add (not sure if @LostRuins would agree btw 🙂):
- set <0 slots: set the "default" amount (fixed for now, maybe dynamically from available RAM in the future)
OTOH, if this pointer needs to be accessed by C++, it should be a full object pointer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expose module is a dumb layer just used to shape and shuttle data inputs and outputs to feed into other functions. you should NOT be adding any implementation logic there.
anyway @wbruna is right, i don't like over-engineering stuff. Like I said earlier, this whole feature can be done with 2 simple helper functions and a slight modification to the generate code. There's no need to go about creating factories or managers or other complex abstractions. Let's keep it simple please.
| #include "smart_cache.h" | ||
|
|
||
| // Global Smart Cache instance (initialized by Python) | ||
| SmartCache::SmartCacheManager* g_smart_cache_manager = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like before in the vector case: does having a pointer here helps at all? The memory cost of a single C++ object in itself is minimal: I'd keep it around directly as a global, and just reconfigure it.
| g_smart_cache_metrics.record_save_to_ram(); | ||
| } | ||
|
|
||
| // Get statistics (returns JSON string - Python must free it) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's a pointer to a static std::string (or its .c_str() representation), Python can't free it.
If Python must free it, you need to return a brand-new heap-allocated string.
|
That said, I hope to add that I don't think anyone's critiquing your
*effort*. Any input is welcome input. It gets the brain energy flowing. 🫡
…On Tue, 25 Nov 2025, 1:37 pm LostRuins Concedo, ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In expose.cpp
<#1851 (comment)>:
> @@ -410,4 +415,183 @@ extern "C"
{
return gpttype_clear_state_kv(true);
}
+
+ // =========================================================================
+ // Smart Cache Functions
+ // =========================================================================
+
+ void* smart_cache_create(double max_ram_gb)
+ {
+ if (g_smart_cache_manager != nullptr) {
+ delete g_smart_cache_manager;
+ }
+ g_smart_cache_manager = new SmartCache::SmartCacheManager(max_ram_gb);
+ g_smart_cache_metrics.reset();
+ return g_smart_cache_manager;
expose module is a dumb layer just used to shape and shuttle data inputs
and outputs to feed into other functions. you should NOT be adding any
implementation logic there.
anyway @wbruna <https://github.com/wbruna> is right, i don't like
over-engineering stuff. Like I said earlier, this whole feature can be done
with 2 simple helper functions and a slight modification to the generate
code. There's no need to go about creating factories or managers or other
complex abstractions. Let's keep it simple please.
—
Reply to this email directly, view it on GitHub
<#1851 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AD32W6FNFNG36A5T5B2RGDT36RESBAVCNFSM6AAAAACMI37K4WVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZTKMBUHA4DSOJZGM>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
|
I Really appreciate guiding and help, i am a web developer, python and c++ are not my jam, so i'm overengineering stuff a lot. Skill issue. Probably, once i've done all the tests (at the moment i am experiencing a CUDA illegal Memory access, i'm debugging It) i'll delete the "stats" endpoint, which causes a lot of code (expecially in the python side) and a few exposed variables. Thanks again for the huge help and feedbacks, i'll do my best to simplify the implementation as much as i can! @HumbleDeer UI implementation has been done, there is a checkbox to enable it and a numeric slider to set the (pushed version is still RAM, soon it'll be max slots) size! Feedbacks are most welcome |
Closes #1827
The Problem
As described in issue #1827, frequent context switching (e.g., in multi-user scenarios like AI Horde) causes significant latency. This occurs because the KV cache in VRAM must be discarded and re-calculated from scratch for each new, unrelated prompt, wasting processing time.
The Solution
The Solution
A Multi-Slot RAM KV Cache system using system RAM to save and restore
KV cache snapshots, inspired by llama.cpp's
server_prompt_cache:ContextFastForward(cache hit)This approach drastically reduces latency during context switches, improving efficiency and response speed in multi-user scenarios.
Architecture: Two-Level Cache System
Key Features
--smartcache,--smartcacherammaxsize, and--smartcachethresholdflags for command-line configuration./api/extra/stats/smartcacheendpoint to monitor cache performance and statistics (hit rate, misses, etc.).Hot to use
Smart Cache Two-Level System Commands:
--smartcache Enable smart cache two-level system for intelligent context switching (default: disabled).
--smartcacherammaxsize [GB]
Maximum RAM size in GB for smart cache slots (default: 10GB). Smart cache will create unlimited slots until this RAM limit is reached. Cannot exceed 90% of total system RAM.
--smartcachethreshold [threshold]
Similarity threshold (0.0-1.0) for cache reuse. Values >= threshold use ContextFastForward, values < threshold trigger context switch with RAM search. (default: 0.8)