Skip to content
This repository was archived by the owner on Oct 11, 2024. It is now read-only.

vllm - quantization : DO NOT MERGE #180

Closed
wants to merge 8 commits into from

Conversation

varun-sundar-rabindranath
Copy link

@varun-sundar-rabindranath varun-sundar-rabindranath commented Apr 11, 2024

DO NOT MERGE

Quantization WIP.

Based off vLLM PR 1508

Quantized model used for dev/testing : https://huggingface.co/nm-testing/Nous-Hermes-Llama2-13b-smoothquant
Base model : https://huggingface.co/NousResearch/Nous-Hermes-Llama2-13b

Testing:
Command : python3 ./examples/offline_quantized_inference.py
Expected output with tensor_parallel_size=1 in ./examples/offline_quantized_inference.py

Prompt: 'Hello, my name is', Generated text: ' John and I am a recovering workaholic.\nI used to work all the time'
Prompt: 'The president of the United States is', Generated text: ' the head of state and head of government of the United States of America. The president leads the executive'
Prompt: 'The capital of France is', Generated text: ' Paris.\nThe capital of France is Paris.'
Prompt: 'The future of AI is', Generated text: ' here, and it’s more accessible than ever.\nThe future of AI is here,'

Expected output with tensor_parallel_size=2 in ./examples/offline_quantized_inference.py

Prompt: 'Hello, my name is', Generated text: ' Dr. John and I am a chiropractor. I have been in practice for over 2'
Prompt: 'The president of the United States is', Generated text: ' the head of state and head of government of the United States of America. The president leads the executive'
Prompt: 'The capital of France is', Generated text: ' Paris.\nThe capital of France is Paris.'
Prompt: 'The future of AI is', Generated text: ' here, and it’s only getting better. With advancements in machine learning and natural language'

Profile command for this branch :

./experiments.sh -t quant -d w8a8 -m torch.float -o ./profile

Profiling results:

Prefill 512 tokens, Branch : vllm-main, dtype : "auto" (fp16), model : Base model - results

Prefill 512 tokens, Branch : N/A (fused-kernels), dtype : "torch.float", model : Quantized model - results

Prefill 512 tokens, Branch : This branch (unfused-kernels), dtype : "torch.float", model : Quantized model - results

Varun Sundar Rabindranath and others added 8 commits April 11, 2024 03:48
This merge is a combination of 2 PRs - #186 and #188:
 -   #188 is based on #186 and #188  is squash-merged onto #186. 

#186 : [1/N] Rs/vllm quantization - Refactor to minimize llama.py changes 
#188 : [2/N] Rs/vllm quantization - Refactor refactor to support non-uniform via config

The PR description from both the PRs are included here for context.  #188 's PR description should be the most relevant as it is the most recent. 

[2/N] Rs/vllm quantization - Refactor refactor to support non-uniform via config

Refactored to support nonuniform quantization by adding a new layer of Abstraction. 

Now, SmoothQuantLinearMethod can hold a SmoothQuantFormat, which implements the details of how to do quant and dequant operations. There are two SmoothQuantFormat classes:

SmoothQuantDynamicPerToken
SmoothQuantStaticPerTensor
We have the following lifecycle:

LinearMethod is created during get_model, has access to QuantizationConfig
Layer is initialized and passed a LinearMethod
Layer calls LinearMethod.create_weights, which creates a dictionary of weights and metadata
Layer calls LinearMethod.apply_weights during inference, passing the dictionary created during create_weights
This PR modifies the LinearMethod.create_weights API to receive a layer_name as argument. The LinearMethod then looks in the config to determine which SmoothQuantFormat to use for the layer with layer_name

As a result, the LinearMethod is responsible for parsing the config from disk and making decisions about what the inference format should look like. In this specific case, since the SmoothQuantConfig is not very good, we just match on the suffix qkv to determine what each layer should use --> but for SparseMLConfig, we could use a similar structure
In this PR, the SmoothQuantFormat is passed in the dictionary returned by create_weights and then is used by apply_weights

In Summary
I think this is a good overall structure because it:

(a) allows us to make minimal changes to the existing models
(b) allows us to make no changes to the model loading lifecycle (i.e. config / constructor / linear method) ** critically requires having one LinearMethod that propagates through the whole model
(c) encapsulates the nonuniform logic into the LinearMethod, allowing us to have a clean interface into
For SparseML Models
We could imagine the following architecture:

Config
Config is responsible for:

loading config from disk
mapping layer_names --> SparseMLFormat
class SparseMLConfig
    def from_dict()
    
    def get_layer_format(layer_name):
         return SparseMLFormat
LinearMethod
Config is responsible for:

interface between layers and kernels (so LinearMethod is what is used by the model)
class SparseMLLinearMethod:
    def __init__(self, sparseml_config)
          self.sparseml_config = sparseml_config
          
    def create_weights(layer_name, ...):
          # this, e.g. is where nonuniform might be supported
          format = self.sparseml_config.get_layer_format(layer_name)
          
          weights = format.get_weights()
          weights["format"] = format
          
          return weights
     
     # wrapper around the SparseML format
     def apply_weights(x, weights, ...)
           format = weights["format"]
           weights = weights["weights"]
           
           return format.apply_weights(x, weights)
SparseMLFormat
Format is responsible for:

actual weight creation and forward
class SparseMLLinearMethod:
    def __init__(self, sparseml_config)
          self.sparseml_config = sparseml_config
          
    def get_weights(sizes):
         # returns dictionary , e.g.
         return {
             "weights": x
             "scales": y
         }
     
     def apply_weights(weights, x):
         # calls cuda kernel 
         return output
Sample Formats:
- W8A8DynamicPerToken
- SparseW8A8StaticPerTensorAsymmetric
- W4A8DynamicPerToken
- ...

[1/N] Rs/vllm quantization - Refactor to minimize llama.py changes #186 

Paired with @dsikka to refactor `SmoothQuantLinearMethod` to avoid
making changes to `llama.py`
- Removed all the "layer specific" `SmoothQuantLinearMethod` by making
the indexing (splitting QKV into logical shards generic and explicitly
handling state_dict converion
- Successfully whittled down to only add one LOC to `llama.py`

Many todos left, including:
- We currently have hardcoded `use_per_token`, need to use the quant
config for this
- We need a way to pass different quantconfigs to each layer to support
nonuniform quantization
)

Since we changes the `LinearMethod` interface to require `layer_name`,
we need to update each model.py to plumb this information through the
models. We need to do this, because we need to pass the `layer_name` to
`LinearMethodBase.create_weights`, such that we have have non-uniform
quantization / compression (as we need to be able to consult the
quantization config to determine what the weights / format should look
like and we use the layer name to decide this

So far, have updated:
- `llama`
- `gemma`
- `phi-2`
- `gpt2`
- `starcoder2`
- `qwen2`
- `deepseek` and `deepseekMoE`
- `baichuan`

To test:
```bash
python3 examples/simple_test.py --help
```

To Update:
- Pass `layer_name` to `QKVParallelLinear`,
`MergedColumnParallelLinear`, `ColumnParallelLinear`,
`RowParallelLinear` by plumbing `parent_name` through from `Model` -->
`DecoderLayer` --> `MLP` / `SelfAttention` --> `Layer`
- Updated `weight_loader` with `linear_method.maybe_update_name`
Description: 
 Cutlass integration. 
  - Use cutlass gemm with epilogue fusion for dequantization
  - Remove all existing dequant kernels and interface
  - Remove cublas i8gemm files

Test:
Run `examples/offline_quantized_inference.py`
```
(vllm-test) varun@floppy-fan:~/code/neuralmagic-vllm (vllm-quantization-cutlass) $ python3 ./examples/offline_quantized_inference.py 
...
Prompt: 'Hello, my name is', Generated text: ' John and I am a recovering workaholic.\nI used to work all the time'
Prompt: 'The president of the United States is', Generated text: ' the head of state and head of government of the United States of America. The president leads the executive'
Prompt: 'The capital of France is', Generated text: ' Paris.\nThe capital of France is Paris.'
Prompt: 'The future of AI is', Generated text: ' here, and it’s more accessible than ever.\nThe future of AI is here,'
```
Profiling results :
Prefill 512 tokens, Branch : This branch, dtype : "torch.float", model :
Quantized model -
[results](https://drive.google.com/file/d/1GydrBmphPTrBMujIPL9K_Y-ZauQ_8FlR/view?usp=sharing)

Note that this branch is better than the [previous
best](https://drive.google.com/file/d/1Ga_rpnRCYUtenBUj_BDPcZVvbIRcvRB8/view?usp=drive_link)
[w8a8 upstream PR with custom fused kernels]

---------

Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants