vllm - quantization : DO NOT MERGE #180

varun-sundar-rabindranath · 2024-04-11T04:16:16Z

DO NOT MERGE

Quantization WIP.

Based off vLLM PR 1508

Quantized model used for dev/testing : https://huggingface.co/nm-testing/Nous-Hermes-Llama2-13b-smoothquant
Base model : https://huggingface.co/NousResearch/Nous-Hermes-Llama2-13b

Testing:
Command : python3 ./examples/offline_quantized_inference.py
Expected output with tensor_parallel_size=1 in ./examples/offline_quantized_inference.py

Prompt: 'Hello, my name is', Generated text: ' John and I am a recovering workaholic.\nI used to work all the time'
Prompt: 'The president of the United States is', Generated text: ' the head of state and head of government of the United States of America. The president leads the executive'
Prompt: 'The capital of France is', Generated text: ' Paris.\nThe capital of France is Paris.'
Prompt: 'The future of AI is', Generated text: ' here, and it’s more accessible than ever.\nThe future of AI is here,'

Expected output with tensor_parallel_size=2 in ./examples/offline_quantized_inference.py

Prompt: 'Hello, my name is', Generated text: ' Dr. John and I am a chiropractor. I have been in practice for over 2'
Prompt: 'The president of the United States is', Generated text: ' the head of state and head of government of the United States of America. The president leads the executive'
Prompt: 'The capital of France is', Generated text: ' Paris.\nThe capital of France is Paris.'
Prompt: 'The future of AI is', Generated text: ' here, and it’s only getting better. With advancements in machine learning and natural language'

Profile command for this branch :

./experiments.sh -t quant -d w8a8 -m torch.float -o ./profile

Profiling results:

Prefill 512 tokens, Branch : vllm-main, dtype : "auto" (fp16), model : Base model - results

Prefill 512 tokens, Branch : N/A (fused-kernels), dtype : "torch.float", model : Quantized model - results

Prefill 512 tokens, Branch : This branch (unfused-kernels), dtype : "torch.float", model : Quantized model - results

@dsikka

This merge is a combination of 2 PRs - #186 and #188: - #188 is based on #186 and #188 is squash-merged onto #186. #186 : [1/N] Rs/vllm quantization - Refactor to minimize llama.py changes #188 : [2/N] Rs/vllm quantization - Refactor refactor to support non-uniform via config The PR description from both the PRs are included here for context. #188 's PR description should be the most relevant as it is the most recent. [2/N] Rs/vllm quantization - Refactor refactor to support non-uniform via config Refactored to support nonuniform quantization by adding a new layer of Abstraction. Now, SmoothQuantLinearMethod can hold a SmoothQuantFormat, which implements the details of how to do quant and dequant operations. There are two SmoothQuantFormat classes: SmoothQuantDynamicPerToken SmoothQuantStaticPerTensor We have the following lifecycle: LinearMethod is created during get_model, has access to QuantizationConfig Layer is initialized and passed a LinearMethod Layer calls LinearMethod.create_weights, which creates a dictionary of weights and metadata Layer calls LinearMethod.apply_weights during inference, passing the dictionary created during create_weights This PR modifies the LinearMethod.create_weights API to receive a layer_name as argument. The LinearMethod then looks in the config to determine which SmoothQuantFormat to use for the layer with layer_name As a result, the LinearMethod is responsible for parsing the config from disk and making decisions about what the inference format should look like. In this specific case, since the SmoothQuantConfig is not very good, we just match on the suffix qkv to determine what each layer should use --> but for SparseMLConfig, we could use a similar structure In this PR, the SmoothQuantFormat is passed in the dictionary returned by create_weights and then is used by apply_weights In Summary I think this is a good overall structure because it: (a) allows us to make minimal changes to the existing models (b) allows us to make no changes to the model loading lifecycle (i.e. config / constructor / linear method) ** critically requires having one LinearMethod that propagates through the whole model (c) encapsulates the nonuniform logic into the LinearMethod, allowing us to have a clean interface into For SparseML Models We could imagine the following architecture: Config Config is responsible for: loading config from disk mapping layer_names --> SparseMLFormat class SparseMLConfig def from_dict() def get_layer_format(layer_name): return SparseMLFormat LinearMethod Config is responsible for: interface between layers and kernels (so LinearMethod is what is used by the model) class SparseMLLinearMethod: def __init__(self, sparseml_config) self.sparseml_config = sparseml_config def create_weights(layer_name, ...): # this, e.g. is where nonuniform might be supported format = self.sparseml_config.get_layer_format(layer_name) weights = format.get_weights() weights["format"] = format return weights # wrapper around the SparseML format def apply_weights(x, weights, ...) format = weights["format"] weights = weights["weights"] return format.apply_weights(x, weights) SparseMLFormat Format is responsible for: actual weight creation and forward class SparseMLLinearMethod: def __init__(self, sparseml_config) self.sparseml_config = sparseml_config def get_weights(sizes): # returns dictionary , e.g. return { "weights": x "scales": y } def apply_weights(weights, x): # calls cuda kernel return output Sample Formats: - W8A8DynamicPerToken - SparseW8A8StaticPerTensorAsymmetric - W4A8DynamicPerToken - ... [1/N] Rs/vllm quantization - Refactor to minimize llama.py changes #186 Paired with @dsikka to refactor `SmoothQuantLinearMethod` to avoid making changes to `llama.py` - Removed all the "layer specific" `SmoothQuantLinearMethod` by making the indexing (splitting QKV into logical shards generic and explicitly handling state_dict converion - Successfully whittled down to only add one LOC to `llama.py` Many todos left, including: - We currently have hardcoded `use_per_token`, need to use the quant config for this - We need a way to pass different quantconfigs to each layer to support nonuniform quantization

) Since we changes the `LinearMethod` interface to require `layer_name`, we need to update each model.py to plumb this information through the models. We need to do this, because we need to pass the `layer_name` to `LinearMethodBase.create_weights`, such that we have have non-uniform quantization / compression (as we need to be able to consult the quantization config to determine what the weights / format should look like and we use the layer name to decide this So far, have updated: - `llama` - `gemma` - `phi-2` - `gpt2` - `starcoder2` - `qwen2` - `deepseek` and `deepseekMoE` - `baichuan` To test: ```bash python3 examples/simple_test.py --help ``` To Update: - Pass `layer_name` to `QKVParallelLinear`, `MergedColumnParallelLinear`, `ColumnParallelLinear`, `RowParallelLinear` by plumbing `parent_name` through from `Model` --> `DecoderLayer` --> `MLP` / `SelfAttention` --> `Layer` - Updated `weight_loader` with `linear_method.maybe_update_name`

Description: Cutlass integration. - Use cutlass gemm with epilogue fusion for dequantization - Remove all existing dequant kernels and interface - Remove cublas i8gemm files Test: Run `examples/offline_quantized_inference.py` ``` (vllm-test) varun@floppy-fan:~/code/neuralmagic-vllm (vllm-quantization-cutlass) $ python3 ./examples/offline_quantized_inference.py ... Prompt: 'Hello, my name is', Generated text: ' John and I am a recovering workaholic.\nI used to work all the time' Prompt: 'The president of the United States is', Generated text: ' the head of state and head of government of the United States of America. The president leads the executive' Prompt: 'The capital of France is', Generated text: ' Paris.\nThe capital of France is Paris.' Prompt: 'The future of AI is', Generated text: ' here, and it’s more accessible than ever.\nThe future of AI is here,' ``` Profiling results : Prefill 512 tokens, Branch : This branch, dtype : "torch.float", model : Quantized model - [results](https://drive.google.com/file/d/1GydrBmphPTrBMujIPL9K_Y-ZauQ_8FlR/view?usp=sharing) Note that this branch is better than the [previous best](https://drive.google.com/file/d/1Ga_rpnRCYUtenBUj_BDPcZVvbIRcvRB8/view?usp=drive_link) [w8a8 upstream PR with custom fused kernels] --------- Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

Varun Sundar Rabindranath and others added 8 commits April 11, 2024 03:48

rebased w8a8

7bf0117

add offline quantized inference

d39af96

use hf-model

c2c59a9

refactor experiments.sh and profiling fixes

c109d5f

format.sh

8f92645

robertgshaw2-redhat closed this Jun 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vllm - quantization : DO NOT MERGE #180

vllm - quantization : DO NOT MERGE #180

Uh oh!

varun-sundar-rabindranath commented Apr 11, 2024 •

edited

Loading

Uh oh!

Uh oh!

vllm - quantization : DO NOT MERGE #180

vllm - quantization : DO NOT MERGE #180

Uh oh!

Conversation

varun-sundar-rabindranath commented Apr 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

varun-sundar-rabindranath commented Apr 11, 2024 •

edited

Loading