Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to exllama v2 #1016

Closed
flozi00 opened this issue Sep 12, 2023 · 12 comments
Closed

Upgrade to exllama v2 #1016

flozi00 opened this issue Sep 12, 2023 · 12 comments

Comments

@flozi00
Copy link
Contributor

flozi00 commented Sep 12, 2023

Feature request

https://github.com/turboderp/exllamav2

Motivation

Overview of differences compared to V1
Faster, better kernels
Cleaner and more versatile codebase
Support for a new quant format

Model Mode Size grpsz act V1: 3090Ti V1: 4090 V2: 3090Ti V2: 4090
Llama GPTQ 7B 128 no 143 t/s 173 t/s 175 t/s 195 t/s
Llama GPTQ 13B 128 no 84 t/s 102 t/s 105 t/s 110 t/s
Llama GPTQ 33B 128 yes 37 t/s 45 t/s 45 t/s 48 t/s
OpenLlama GPTQ 3B 128 yes 194 t/s 226 t/s 295 t/s 321 t/s

Your contribution

I could take a look to actual exllama implementation and what it takes to upgrade, if wanted

@flozi00
Copy link
Contributor Author

flozi00 commented Oct 17, 2023

AutoGPTQ/AutoGPTQ#349

Maybe @SunMarc could give some advise?

@flozi00 flozi00 mentioned this issue Oct 17, 2023
5 tasks
@SunMarc
Copy link
Member

SunMarc commented Oct 18, 2023

Hi @flozi00 , happy to see that you are interested in adding support to TGI with exllamav2 kernel. I would be happy to review the PR. The integration with transformers and optimum is practically done too. On the kernel side, everything should work from my tests on autogptq. Make sure to add tests in TGI. You can take inspiration from @fxmarty integration of exllama in TGI.

@josephrocca
Copy link

I believe this issue can be closed now.

@fxmarty fxmarty closed this as completed Nov 28, 2023
@josephrocca
Copy link

josephrocca commented Jan 5, 2024

@OlivierDehaene @Narsil I see you were both in the last commit to exllamav2.py. Per Benjamin's comment above, I'm also wondering, whether TGI currently has usable exl2 support?

If it's currently not supported, then I think this should be reopened?

@OlivierDehaene
Copy link
Member

exllama v2 is super fast but also super finicky. Is it activated by default if you are not sharding your model as we could not make it work with TP yet.

@josephrocca
Copy link

josephrocca commented Jan 9, 2024

@OlivierDehaene Does this mean that we can load exl2 models with TGI? Or is this only for running GPTQ models with exllama runtime/kernels? (I'm not sure how that works - but IIRC there is no --quantize exl2 option available.)

The former would be great because then we'd get Mixtral on a single 3090: https://twitter.com/turboderp_/status/1741232488301596674 (I've tested this using the "raw" exllama runtime and it works great, but I'm not sure how to do it with TGI)

@Narsil
Copy link
Collaborator

Narsil commented Jan 10, 2024

Yes it's only the GPTQ versions of exl2. exl2 layout is a bit more finicky to add, although probably not impossible.

Technically it's using exactly the same kernels:

Names are different: https://github.com/turboderp/exllamav2/blob/master/exllamav2/module.py#L82 and here: https://github.com/turboderp/exllamav2/blob/master/exllamav2/ext.py#L184

Although the fact that exl2 crashes pretty badly with TP>1 is quite concerning, and I don't really want to debug exl2 kernels for now (I've spent quite some time trying to find why the kernels segfaults but couldn't find why it was hitting so far out of where it should, I'm guessing some too manual pointer logic to hit the "scratch" buffers.)

If you want to look into it, that'd be great tbh.
I must say that the TP>1 bug has been the main roadblock for exl2 since we really need to be able to load models with TP for production loads (it yields such latency improvements most of the times it's kind of hard to pass on). Also we try not to use quantization too much since it does harm the model quite extensively (especially out of domain which cannot be captured by benchmarks).

@josephrocca
Copy link

@Narsil This pull request of yours looks exciting:

Does it pave the way for loading the exl2 model format in TGI? Or is it not something that the team is too interested in right now?


Side note: There are actually more exl2 models (3.2k) on the hub now than GPTQ (2.6k), though this is somewhat due to some prolific users doing lots of quants, since the number of unique users who have published an exl2 quant is 132, whereas there are 511 for GPTQ1. Still, GPTQ had a 9 month head start, and it does seem like exl2 is becoming more popular recently for its ability to fit very large models (>100B) into GPUs. Heavy quantization does seem like it's going to be the "future" to some extent - https://twitter.com/tri_dao/status/1757331306260922515

tri_dao: I've been curious about quantization scaling laws since Tim_Dettmers showed that 4-bit gets best quality holding total model bits constant. With clever math in QuIP# + some finetuning, looks like the sweet spot is shifting to 3-bit or maybe even 2-bit in the future (quote tweeting: https://twitter.com/tsengalb99/status/1757145731448656060)


1 Data collection code (click to expand)
let rows = [];
for(let p = 0; p <= 120; p++) { // 120 pages as of writing
  let data = await fetch(`https://huggingface.co/models-json?p=${p}&sort=modified&search=exl2`).then(r => r.json());
  rows.push(...data.models);
  console.log(`Page: ${p}`);
}
for(let p = 0; p <= 80; p++) { // 80 pages as of writing
  let data = await fetch(`https://huggingface.co/models-json?p=${p}&sort=modified&search=gptq`).then(r => r.json());
  rows.push(...data.models);
  console.log(`Page: ${p}`);
}
rows.forEach(r => r.lastModifiedEpochMs=new Date(r.lastModified).getTime());
rows.sort((a,b) => a.lastModifiedEpochMs-b.lastModifiedEpochMs);

let gotAuthor = new Set();
for(let str of ["exl2", "gptq"]) {
  let times = rows.filter(m => m.id.includes(str)).filter(m => gotAuthor.has(m.author) ? false : (gotAuthor.add(m.author), true)).map(m => m.lastModifiedEpochMs);
  console.log(str, times.length);
}

@houmie
Copy link

houmie commented Apr 29, 2024

If you want to look into it, that'd be great tbh. I must say that the TP>1 bug has been the main roadblock for exl2 since we really need to be able to load models with TP for production loads (it yields such latency improvements most of the times it's kind of hard to pass on). Also we try not to use quantization too much since it does harm the model quite extensively (especially out of domain which cannot be captured by benchmarks).

Hi @Narsil
Sorry I found this by googling and your comment was insightful. I'm a big fan of exl2 format due obvious benefits, but have difficulties finding a production ready host platform that supports it. To name a few TGI, vLLM, olama don't support it yet. That makes me think maybe I shouldn't use exl2 after all and fallback to something else that is widely supported and is production ready. Which format do you recommend, please?

Thanks

@josephrocca
Copy link

josephrocca commented Apr 29, 2024

Tangential note: At least with Llama 3 70B, with constraint of fitting within 48GB VRAM, the community seems to be leaning toward EXL2. E.g. the EXL2 quants came out on top in WolframRavenwolf's most recent tests:

https://www.reddit.com/r/LocalLLaMA/comments/1cal17l/llm_comparisontest_llama_3_instruct_70b_8b/

The best inference engine for keeping up with quantization (including EXL2) right now seems to be https://github.com/PygmalionAI/aphrodite-engine and it worked well in my tests a couple of months ago, but I haven't put it into production yet.

Second place in Wolfram's tests was AWQ, which TGI does currently have support for.

@fxmarty
Copy link
Contributor

fxmarty commented Apr 30, 2024

AFAIK EXL2 is on the roadmap for TGI.

@Narsil
Copy link
Collaborator

Narsil commented May 1, 2024

@houmie exl2 is very nice, this would be my go-to for <4bit models (and the reason why we want to add support).

For 4bit quants I'd say AWQ/GPTQ are both great (problem GPTQ comes with different flavors which have different performance profiles with the right options GPTQ has better latency, slightly worse throughput than AWQ, but pretty much the same overall).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants