Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Not enough space in the context's memory pool" exception in 1.52 #563

Closed
Vladonai opened this issue Dec 13, 2023 · 19 comments
Closed

"Not enough space in the context's memory pool" exception in 1.52 #563

Vladonai opened this issue Dec 13, 2023 · 19 comments

Comments

@Vladonai
Copy link

The new version of Koboldcpp (1.52) terminates with an error when trying to load the model "nethena-mlewd-xwin-23b.Q3_K_M.gguf":
exception

This was not the case with previous versions of the program. I have 64GB of RAM and 8GB of VRAM.

The model is from here: https://huggingface.co/TheBloke/Nethena-MLewd-Xwin-23B-GGUF/tree/main

@7erminalVelociraptor
Copy link

Same error, different model (Goliath 120b). Worked fine until I updated, can't get it to load no matter what flags I set.

@VL4DST3R
Copy link

Yeah, I get the same with multiple models I've tried. I figured it was the larger vram requirements noted in the changelog:

Partial per-layer KV offloading is now merged for CUDA. Important: this means that the number of layers you can offload to GPU might be reduced, as each layer now takes up more space. To avoid per-layer KV offloading, use the --usecublas lowvram option (equivalent to -nkvo in llama.cpp). Fully offloaded models should behave the same as before.

but reducing the offloaded layers even to extreme levels changes nothing.

@7erminalVelociraptor
Copy link

7erminalVelociraptor commented Dec 14, 2023

Yeah, I get the same with multiple models I've tried. I figured it was the larger vram requirements noted in the changelog:

Partial per-layer KV offloading is now merged for CUDA. Important: this means that the number of layers you can offload to GPU might be reduced, as each layer now takes up more space. To avoid per-layer KV offloading, use the --usecublas lowvram option (equivalent to -nkvo in llama.cpp). Fully offloaded models should behave the same as before.

but reducing the offloaded layers even to extreme levels changes nothing.

Also tried running --cublas lowvram but still no dice, I even tried using --clblast instead of cublas and still the same error. Tried lower context size, the same. Tried using --nommap but that just filled my ram so completely that the destop locked up and only a hard reboot was possible.

It's unlikely to be the enviroment, as it looks like OP uses windows and I'm on Arch Linux. If the entire release was completely broken confused users would have been streaming in from the get go, but for now that doesn't seem to be the case.

What's going on?

@LostRuins
Copy link
Owner

I can repro this and have found a solution, but it is a bit strange as to why it happens. I'll also open an issue upstream as they'll likely have the same problem.

@LostRuins
Copy link
Owner

I think I fixed it but if someone can verify with ggerganov#4461 would be good.

@VL4DST3R
Copy link

@LostRuins From reading your other thread this seems to be the caused by the "partial per-layer KV offloading" in the latest version, but I have to ask: what exactly does it do/mean exactly? Tried looking it up on the wiki here and more broadly online but came up blank. Also is it preferred over the increase in vram usage?

@LostRuins
Copy link
Owner

Actually this issue is not related to the partial KV offloading at all. It was caused by a numerical precision overflow, where a float cannot fully represent a very big number accurately.

Previously, KV is only offloaded at the end after all other layers have been offloaded. Partial KV offloading is being able to offload a portion of the KV progressively alongside the other layers, which usually results in a faster speed if you're able to offload like 3/4 of all layers.

I personally don't find it that useful, and usually disable it for myself. But some people seem to like it.

@VL4DST3R
Copy link

I personally don't find it that useful, and usually disable it for myself. But some people seem to like it.

So you use --usecublas lowvram from now on then? Shouldn't it be an optional flag then and not on by default?

Also is there any objective way to benchmark this? I'm guessing... seeing how a model runs with the same amount of gpu memory taken?

@LostRuins
Copy link
Owner

You are right, I am setting lowvram to be disabled by default in the next version.

Benchmark by measuring time taken with a bunch of prompts. I have done so before - see ggerganov#4309

Some people claim it's better for them - it is a trade off as I would rather have slower processing but faster generation.

@VL4DST3R
Copy link

VL4DST3R commented Dec 14, 2023

You are right, I am setting lowvram to be disabled by default in the next version.

Isn't that already the case? And no no, I meant it as "is using this flag how you personally plan to counteract this change from now on in your own personal gens?" since you said you prefer the old behavior with faster generation vs processing (just like me)

I was trying to figure out what i need to change in my settings so that this doesn't affect me negatively.

@LostRuins
Copy link
Owner

I was referring to the GUI defaults, when running in GUI mode. I will leave the lowvram checkbox unchecked by default in the next version. If you're using the command line, then it will be whatever you set it.

You don't have to change anything. All existing configs and command line args will work.

@VL4DST3R
Copy link

VL4DST3R commented Dec 14, 2023

Got it but my point was that if one wants to keep the slower processing and faster generation (the pre 1.52 behavior when not fully offloading to vram) the only way currently seems to be using lowvram, but I know this also leaves out scratch buffers, which I imagine results in a speed decrease. Is there a way to just not use the partial offloading specifically, leaving everything else as-is?

@LostRuins
Copy link
Owner

LostRuins commented Dec 14, 2023

Ah no, the scratch buffer thing was for pre-gguf model behavior. There is no difference anymore.

Basically:

  • For full offload, do not enable lowvram
  • If you enable lowvram, and do not full offload, behavior is identical to v1.51 with no lowvram. Because lowvram does not affect GGUF models prior to this version, and in fact has already been removed upstream.

@Vladonai
Copy link
Author

Some people claim it's better for them - it is a trade off as I would rather have slower processing but faster generation.

I don't understand why this is even necessary. The task of initial processing of a large context is much, MUCH better solved by preserving the model context on exit. Everything else is a disadvantage.

Tested the new version (1.52.1) with the model from the initial post. Now it loads and works.

@VL4DST3R
Copy link

Partial KV offloading is being able to offload a portion of the KV progressively alongside the other layers, which usually results in a faster speed if you're able to offload like 3/4 of all layers.

So even with the reduced amount of layers given the size increase it should still generate faster if over 3/4 of all layers? How? I thought it only affected processing and nothing else?

Ah no, the scratch buffer thing was for pre-gguf model behavior. There is no difference anymore.

But if one were to still use pre-gguf models (e.g. airoboros 33b) It would still be detrimental, no? 👀

I know I'm biased here since I value actual gen speed over processing, but wouldn't it be easier to have it (partial offloading) as an optional flag for people who want to use it specifically? You mentioned a very good point on the other thread namely:

Overall, trading a ~40% increase PP speed for a 22% reduced generation speed does not really feel like a worthwhile trade to me, especially if we take into account the reduced need for rapid prompt processing while using context shifting. I think perhaps you were testing at -c 512 which has a much smaller difference in number of layers offloaded. At 4096 ctx, the difference will probably be even greater.

which leads me to believe this will be most likely detrimental for most kcpp users since I imagine almost everyone makes use of smart context/context shifting to reduce subsequent processing.

@LostRuins
Copy link
Owner

Yeah if you are using pre-gguf models, then you should toggle the lowvram off/on as needed.

Partial offloading is and optional flag, it's disabled with --lowvram (and thats all that lowvram does for GGUF models, which is disabling all KV offloading). I previously tried out making it lowvram by default but then others objected too.

@VL4DST3R
Copy link

Partial offloading is and optional flag, it's disabled with --lowvram

It is optional in the sense that it can be toggled, yes, but since it's not the default state, doesn't that make the lowvram flag the optional one in this context instead?

I previously tried out making it lowvram by default but then others objected too.

I understand, hard to please everyone. Ultimately given that the issue isn't really an issue per-se and just a preference, I guess ultimately you should have the final say in how this behavior should be set by default.

@Vladonai
Copy link
Author

Vladonai commented Dec 15, 2023

I understand, hard to please everyone. Ultimately given that the issue isn't really an issue per-se and just a preference, I guess ultimately you should have the final say in how this behavior should be set by default.

Am I understanding correctly that if I didn't read this thread, I would suddenly lose ~20% of generation speed after upgrading to 1.52? :)

@VL4DST3R
Copy link

VL4DST3R commented Dec 15, 2023

Not exactly. You would now OOM a lot faster, and fixing that would net you some speed reduction. It's why I figured it should be pointed out as an issue but oh well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants