Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration with Hugging Face transformers library #30

Open
SunMarc opened this issue Jul 11, 2024 · 3 comments
Open

Integration with Hugging Face transformers library #30

SunMarc opened this issue Jul 11, 2024 · 3 comments

Comments

@SunMarc
Copy link

SunMarc commented Jul 11, 2024

Hi neuralmagic team !

Very nice work with AutoFP8 ! We were thinking of integrating AutoFP8 in transformers, so that users can run your checkpoints directly with transformers ! We would simply replace the linear layers by its quantized version. Hence, we would only support the inference. Let us know if you agree with this ! The goal would be to explose the quantized linear layer class in this repo (I see that you have several quantized linear) and import it in transformers.

I will be leading the integration, so any help is appreciated ! Also, are there any big blockers that I might not have seen ?

Thanks in advance !

@robertgshaw2-neuralmagic
Copy link

robertgshaw2-neuralmagic commented Jul 11, 2024

Hey @SunMarc - we are planning to push most of our development into llm-compressor and compressed-tensors which are the successors to this mini-repo that we are already working on integrating it into transformers (huggingface/transformers#31704)

This supports:

  • mixed precision w4a16 / w8a16
  • w8a8 int8 (activation quantization)
  • w8a8 fp8 (float point quantization)

We also support the following algorithms which can be applied to both fp8 and int8 and int4 models:

  • ptq
  • gptq
  • smoothquant
  • sparsegpt

We would prefer to put efforts related to transformers behind this framework (including doing a surge on fp8 and int8 compute with our cutlass kernels that we use in vllm)

@robertgshaw2-neuralmagic

Couple other notes for fp8 on various compute capabilities:

  • For pre-ampere, we could emulate (since the conversion is quick)
  • For ampere GPUs, we can add support for our fp8 marlin kernels (mixed precision)
  • For lovelace/hopper GPUs we can use torch.scaled_mm to make it easy

@robertgshaw2-neuralmagic

For MoEs:

  • we could consider adding the triton kernels we have in vllm as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants