-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for ExLlama library #281
Comments
To start, I opened a PR on exllama to support applying a bias to the logits in the generation of a token: turboderp/exllama#104 |
I similarly would love exllama support, as it's currently the fastest and most memory-efficient executor of models that I'm aware of. On my 3080 12GB, a 13B model (4bit gptq) fits comfortably when loaded with exllama even for long context sizes, while the normal gptq/autogptq loaders will run out of memory if my context gets too long. I played around a bit with the exllama wrapper in oobabooga/text-generation-webui, and hacked on it until guidance accepted it, and got it to work - but only if I disable caching, which unfortunately is a pretty big compromise :( The ExLlamaCache is just too different from the normal HF cache, and guidance seems to expect the HF cache in certain areas (like checking the current context length). But someone with more experience and patience than I could probably get it to work. |
@andysalerno : Is this your WIP code which someone could continue to work on? |
@Glavin001 it is, but I gave up on it since the lower-level stuff went over my head. I was expecting to just make a little shim class that wraps exllama and presents it as a normal HF transformer class, but the way they work internally is so different, that it ended up being too difficult for me. Using the code you linked, I was able to get token generation using both Guidance and exllama - but without caching, which decimates the performance. And now that I know a bit more, there are probably other issues I didn't account for. So I wouldn't use it. So far this exllama wrapper seems to be the closest path: https://github.com/oobabooga/text-generation-webui/blob/c95009d2bd55b5b332148a53a445f3c619a004a5/modules/exllama_hf.py#L27C1-L27C34 But it doesn't provide everything Guidance needs to work. |
Hi, thanks for the guidance project, I really like it!
I've been wondering for sometime about the possibility of extending guidance to support more open source projects, such as ExLlama (https://github.com/turboderp/exllama) and RWKV models (https://github.com/BlinkDL/RWKV-LM).
Is there an interest from the maintainers in adding this support?
I've looked a bit into guidance's source code, and it seems the key feature to get this sort of integration working is having a way to add a bias to the prediction of the models. I'm fairly sure it's possible to do this with these libraries (though might require a pull request in their repositories too).
For instance, the sample function of ExLlama currently already supports receiving logits as a parameter, and even adding a negative bias to constrained tokens: https://github.com/turboderp/exllama/blob/a01b25c884881871a0f75c96bbc582b6581665cb/generator.py#L88
What do you think?
I'd be happy to also give a shot at this, though it may be a bit hard for me since I'm not the core maintainer of any of these repositories.
Thanks!
The text was updated successfully, but these errors were encountered: