Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for ExLlama library #281

Open
paolorechia opened this issue Jun 26, 2023 · 4 comments
Open

Support for ExLlama library #281

paolorechia opened this issue Jun 26, 2023 · 4 comments

Comments

@paolorechia
Copy link

Hi, thanks for the guidance project, I really like it!

I've been wondering for sometime about the possibility of extending guidance to support more open source projects, such as ExLlama (https://github.com/turboderp/exllama) and RWKV models (https://github.com/BlinkDL/RWKV-LM).

Is there an interest from the maintainers in adding this support?

I've looked a bit into guidance's source code, and it seems the key feature to get this sort of integration working is having a way to add a bias to the prediction of the models. I'm fairly sure it's possible to do this with these libraries (though might require a pull request in their repositories too).

For instance, the sample function of ExLlama currently already supports receiving logits as a parameter, and even adding a negative bias to constrained tokens: https://github.com/turboderp/exllama/blob/a01b25c884881871a0f75c96bbc582b6581665cb/generator.py#L88

What do you think?

I'd be happy to also give a shot at this, though it may be a bit hard for me since I'm not the core maintainer of any of these repositories.

Thanks!

@paolorechia
Copy link
Author

To start, I opened a PR on exllama to support applying a bias to the logits in the generation of a token: turboderp/exllama#104

@andysalerno
Copy link

I similarly would love exllama support, as it's currently the fastest and most memory-efficient executor of models that I'm aware of. On my 3080 12GB, a 13B model (4bit gptq) fits comfortably when loaded with exllama even for long context sizes, while the normal gptq/autogptq loaders will run out of memory if my context gets too long.

I played around a bit with the exllama wrapper in oobabooga/text-generation-webui, and hacked on it until guidance accepted it, and got it to work - but only if I disable caching, which unfortunately is a pretty big compromise :( The ExLlamaCache is just too different from the normal HF cache, and guidance seems to expect the HF cache in certain areas (like checking the current context length). But someone with more experience and patience than I could probably get it to work.

@Glavin001
Copy link

I played around a bit with the exllama wrapper in oobabooga/text-generation-webui, and hacked on it until guidance accepted it, and got it to work - but only if I disable caching

@andysalerno : Is this your WIP code which someone could continue to work on?

@andysalerno
Copy link

@Glavin001 it is, but I gave up on it since the lower-level stuff went over my head. I was expecting to just make a little shim class that wraps exllama and presents it as a normal HF transformer class, but the way they work internally is so different, that it ended up being too difficult for me.

Using the code you linked, I was able to get token generation using both Guidance and exllama - but without caching, which decimates the performance. And now that I know a bit more, there are probably other issues I didn't account for. So I wouldn't use it.

So far this exllama wrapper seems to be the closest path: https://github.com/oobabooga/text-generation-webui/blob/c95009d2bd55b5b332148a53a445f3c619a004a5/modules/exllama_hf.py#L27C1-L27C34

But it doesn't provide everything Guidance needs to work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants