This repository was archived by the owner on Oct 11, 2024. It is now read-only.
forked from vllm-project/vllm
-
Notifications
You must be signed in to change notification settings - Fork 10
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This merge is a combination of 2 PRs - #186 and #188: - #188 is based on #186 and #188 is squash-merged onto #186. #186 : [1/N] Rs/vllm quantization - Refactor to minimize llama.py changes #188 : [2/N] Rs/vllm quantization - Refactor refactor to support non-uniform via config The PR description from both the PRs are included here for context. #188 's PR description should be the most relevant as it is the most recent. [2/N] Rs/vllm quantization - Refactor refactor to support non-uniform via config Refactored to support nonuniform quantization by adding a new layer of Abstraction. Now, SmoothQuantLinearMethod can hold a SmoothQuantFormat, which implements the details of how to do quant and dequant operations. There are two SmoothQuantFormat classes: SmoothQuantDynamicPerToken SmoothQuantStaticPerTensor We have the following lifecycle: LinearMethod is created during get_model, has access to QuantizationConfig Layer is initialized and passed a LinearMethod Layer calls LinearMethod.create_weights, which creates a dictionary of weights and metadata Layer calls LinearMethod.apply_weights during inference, passing the dictionary created during create_weights This PR modifies the LinearMethod.create_weights API to receive a layer_name as argument. The LinearMethod then looks in the config to determine which SmoothQuantFormat to use for the layer with layer_name As a result, the LinearMethod is responsible for parsing the config from disk and making decisions about what the inference format should look like. In this specific case, since the SmoothQuantConfig is not very good, we just match on the suffix qkv to determine what each layer should use --> but for SparseMLConfig, we could use a similar structure In this PR, the SmoothQuantFormat is passed in the dictionary returned by create_weights and then is used by apply_weights In Summary I think this is a good overall structure because it: (a) allows us to make minimal changes to the existing models (b) allows us to make no changes to the model loading lifecycle (i.e. config / constructor / linear method) ** critically requires having one LinearMethod that propagates through the whole model (c) encapsulates the nonuniform logic into the LinearMethod, allowing us to have a clean interface into For SparseML Models We could imagine the following architecture: Config Config is responsible for: loading config from disk mapping layer_names --> SparseMLFormat class SparseMLConfig def from_dict() def get_layer_format(layer_name): return SparseMLFormat LinearMethod Config is responsible for: interface between layers and kernels (so LinearMethod is what is used by the model) class SparseMLLinearMethod: def __init__(self, sparseml_config) self.sparseml_config = sparseml_config def create_weights(layer_name, ...): # this, e.g. is where nonuniform might be supported format = self.sparseml_config.get_layer_format(layer_name) weights = format.get_weights() weights["format"] = format return weights # wrapper around the SparseML format def apply_weights(x, weights, ...) format = weights["format"] weights = weights["weights"] return format.apply_weights(x, weights) SparseMLFormat Format is responsible for: actual weight creation and forward class SparseMLLinearMethod: def __init__(self, sparseml_config) self.sparseml_config = sparseml_config def get_weights(sizes): # returns dictionary , e.g. return { "weights": x "scales": y } def apply_weights(weights, x): # calls cuda kernel return output Sample Formats: - W8A8DynamicPerToken - SparseW8A8StaticPerTensorAsymmetric - W4A8DynamicPerToken - ... [1/N] Rs/vllm quantization - Refactor to minimize llama.py changes #186 Paired with @dsikka to refactor `SmoothQuantLinearMethod` to avoid making changes to `llama.py` - Removed all the "layer specific" `SmoothQuantLinearMethod` by making the indexing (splitting QKV into logical shards generic and explicitly handling state_dict converion - Successfully whittled down to only add one LOC to `llama.py` Many todos left, including: - We currently have hardcoded `use_per_token`, need to use the quant config for this - We need a way to pass different quantconfigs to each layer to support nonuniform quantization
) Since we changes the `LinearMethod` interface to require `layer_name`, we need to update each model.py to plumb this information through the models. We need to do this, because we need to pass the `layer_name` to `LinearMethodBase.create_weights`, such that we have have non-uniform quantization / compression (as we need to be able to consult the quantization config to determine what the weights / format should look like and we use the layer name to decide this So far, have updated: - `llama` - `gemma` - `phi-2` - `gpt2` - `starcoder2` - `qwen2` - `deepseek` and `deepseekMoE` - `baichuan` To test: ```bash python3 examples/simple_test.py --help ``` To Update: - Pass `layer_name` to `QKVParallelLinear`, `MergedColumnParallelLinear`, `ColumnParallelLinear`, `RowParallelLinear` by plumbing `parent_name` through from `Model` --> `DecoderLayer` --> `MLP` / `SelfAttention` --> `Layer` - Updated `weight_loader` with `linear_method.maybe_update_name`
Description: Cutlass integration. - Use cutlass gemm with epilogue fusion for dequantization - Remove all existing dequant kernels and interface - Remove cublas i8gemm files Test: Run `examples/offline_quantized_inference.py` ``` (vllm-test) varun@floppy-fan:~/code/neuralmagic-vllm (vllm-quantization-cutlass) $ python3 ./examples/offline_quantized_inference.py ... Prompt: 'Hello, my name is', Generated text: ' John and I am a recovering workaholic.\nI used to work all the time' Prompt: 'The president of the United States is', Generated text: ' the head of state and head of government of the United States of America. The president leads the executive' Prompt: 'The capital of France is', Generated text: ' Paris.\nThe capital of France is Paris.' Prompt: 'The future of AI is', Generated text: ' here, and it’s more accessible than ever.\nThe future of AI is here,' ``` Profiling results : Prefill 512 tokens, Branch : This branch, dtype : "torch.float", model : Quantized model - [results](https://drive.google.com/file/d/1GydrBmphPTrBMujIPL9K_Y-ZauQ_8FlR/view?usp=sharing) Note that this branch is better than the [previous best](https://drive.google.com/file/d/1Ga_rpnRCYUtenBUj_BDPcZVvbIRcvRB8/view?usp=drive_link) [w8a8 upstream PR with custom fused kernels] --------- Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
DO NOT MERGE
Quantization WIP.
Based off vLLM PR 1508
Quantized model used for dev/testing : https://huggingface.co/nm-testing/Nous-Hermes-Llama2-13b-smoothquant
Base model : https://huggingface.co/NousResearch/Nous-Hermes-Llama2-13b
Testing:
Command :
python3 ./examples/offline_quantized_inference.py
Expected output with
tensor_parallel_size=1
in./examples/offline_quantized_inference.py
Expected output with
tensor_parallel_size=2
in./examples/offline_quantized_inference.py
Profile command for this branch :
./experiments.sh -t quant -d w8a8 -m torch.float -o ./profile
Profiling results:
Prefill 512 tokens, Branch : vllm-main, dtype : "auto" (fp16), model : Base model - results
Prefill 512 tokens, Branch : N/A (fused-kernels), dtype : "torch.float", model : Quantized model - results
Prefill 512 tokens, Branch : This branch (unfused-kernels), dtype : "torch.float", model : Quantized model - results