Skip to content

NetEase-FuXi/EETQ

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EETQ

Easy & Efficient Quantization for Transformers

Table of Contents

Features

  • INT8 weight only PTQ
    • High-performance GEMM kernels from FasterTransformer, original code
    • No need for quantization training
  • Optimized attention layer using Flash-Attention V2
  • Easy to use, adapt to your pytorch model with one line of code

Getting started

Environment

  • cuda:>=11.4
  • python:>=3.8
  • gcc:>= 7.4.0
  • torch:>=1.14.0
  • transformers:>=4.27.0

The above environment is the minimum configuration, and it is best to use a newer version.

Installation

Recommend using Dockerfile.

$ git clone https://github.com/NetEase-FuXi/EETQ.git
$ cd EETQ/
$ git submodule update --init --recursive
$ pip install .

If your machine has less than 96GB of RAM and lots of CPU cores, ninja might run too many parallel compilation jobs that could exhaust the amount of RAM. To limit the number of parallel compilation jobs, you can set the environment variable MAX_JOBS:

$ MAX_JOBS=4 pip install .

Usage

  1. Quantize torch model
from eetq.utils import eet_quantize
eet_quantize(torch_model)
  1. Quantize torch model and optimize with flash attention
...
model = AutoModelForCausalLM.from_pretrained(model_name, config=config, torch_dtype=torch.float16)
from eetq.utils import eet_accelerator
eet_accelerator(model, quantize=True, fused_attn=True, dev="cuda:0")
model.to("cuda:0")

# inference
res = model.generate(...)
  1. Use EETQ in TGI(text-generation-inference) see this
--quantize eetq
  1. Use EETQ in LoRAX. See docs here.
lorax-launcher --model-id mistralai/Mistral-7B-v0.1 --quantize eetq ...

Examples

Model:

Performance

  • llama-13b (test on 3090)

About

Easy and Efficient Quantization for Transformers

Resources

License

Stars

Watchers

Forks

Packages

No packages published