Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kernel] Initial Activation Quantization Support #4525

Merged
merged 49 commits into from
May 23, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
4d27a2c
Initial `CompressedTensors` config + Activation Quantization support …
dsikka Apr 30, 2024
92b3703
add get_quant method to compressed tensors config
dsikka Apr 30, 2024
2a3eb83
small rebase fixed
dsikka Apr 30, 2024
3dd1fe8
format
dsikka Apr 30, 2024
f2f8c52
fix mypy complaints
Apr 30, 2024
c9308eb
Merge branch 'main' into ds-quant
dsikka Apr 30, 2024
d9d49b5
format fixes
dsikka Apr 30, 2024
b111ee6
Merge branch 'main' into ds-quant
dsikka May 1, 2024
c31a7af
format fix post rebase
dsikka May 1, 2024
ca01b39
lazy import CompressedTensorsW8A8StaticTensor (#220)
varun-sundar-rabindranath May 1, 2024
f0197d4
lazy cutlass_gemm_dq import (#221)
varun-sundar-rabindranath May 1, 2024
4624b46
fix asm
May 1, 2024
75757d5
update shape change
dsikka May 2, 2024
e1df0eb
add todo
dsikka May 2, 2024
bc0991c
Rename quant_per_tensor -> static_scaled_int8_quant
May 2, 2024
74ad650
Remove cruft
May 2, 2024
43c43f3
Merge branch 'main' into ds-quant
dsikka May 14, 2024
cf5600f
fixes : typo
May 14, 2024
169ce7f
py-cutlass temporary hack for num_prompts==1
May 15, 2024
03b53e7
yapf
May 15, 2024
f9df31b
add test_int8_quant
May 16, 2024
ba4b6b3
call cpp cutlass
May 17, 2024
3c223c6
Merge branch 'main' into ds-quant
dsikka May 17, 2024
b27f31a
remove cutlass py interface
May 17, 2024
b589cdd
format.sh
May 17, 2024
98159cf
remove fake-quant
May 17, 2024
8dbeb31
add compressed tensors test
dsikka May 17, 2024
5eeb40a
remove torch.int8
dsikka May 17, 2024
c55e023
format
dsikka May 17, 2024
f5cbbd3
fix config parsing to match new model
dsikka May 20, 2024
a685957
revert parsing to use default pathway
dsikka May 20, 2024
4dfb37f
PR comments
dsikka May 21, 2024
de81f9e
Fix scales/zero-points device allocation
May 21, 2024
15f1863
ruff
May 21, 2024
bd53847
add better comments
May 21, 2024
b2926f3
add comment
dsikka May 22, 2024
1274386
Merge branch 'main' into ds-quant
dsikka May 22, 2024
18640c8
clang format
dsikka May 22, 2024
5c5dc84
clang format again
dsikka May 22, 2024
a44b4a0
address PR comments
May 22, 2024
6f0e6e1
clang-format
May 22, 2024
0090454
remove layer name
dsikka May 23, 2024
4b10fd7
remove unused import
dsikka May 23, 2024
68a59c7
remove parent name
dsikka May 23, 2024
b0afe67
Fix rounding
May 22, 2024
4f4951e
comment
May 23, 2024
869de3f
cruft
May 23, 2024
e68e391
yapf
May 23, 2024
d77cf50
remove unquantized check
dsikka May 23, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Merge branch 'main' into ds-quant
  • Loading branch information
dsikka authored May 14, 2024
commit 43c43f3c494afc7b55919a1f83609fbb07d7e8eb
1 change: 1 addition & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -169,6 +169,7 @@ set(VLLM_EXT_SRC
"csrc/quantization/gptq/q_gemm.cu"
"csrc/quantization/compressed_tensors/int8_quant_kernels.cu"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit unclear to me about the name compressed_tensors. I suppose this is the official method name of SparseML? Then can we just use sparseml here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compressed-tensors is the name of the package responsible for saving quantized and sparse models

So the flow is:

  • use SparseML to apply quantization / sparsity
  • save model to safetensors with a compressed-tensors config
  • load + run in vllm

"csrc/quantization/fp8/fp8_cuda_kernels.cu"
"csrc/quantization/fp8/common.cu"
"csrc/cuda_utils_kernels.cu"
"csrc/moe_align_block_size_kernels.cu"
"csrc/pybind.cpp")
Expand Down
2 changes: 1 addition & 1 deletion requirements-cuda.txt
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,4 @@ vllm-nccl-cu12>=2.18,<2.19 # for downloading nccl library
torch == 2.3.0
xformers == 0.0.26.post1 # Requires PyTorch 2.3.0
nvidia-cutlass == 3.5.0

vllm-flash-attn == 2.5.8.post1 # Requires PyTorch 2.3.0
3 changes: 3 additions & 0 deletions vllm/model_executor/layers/quantization/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@
QuantizationConfig)
from vllm.model_executor.layers.quantization.compressed_tensors.compressed_tensors import ( # noqa: E501
CompressedTensorsConfig)
from vllm.model_executor.layers.quantization.deepspeedfp import (
DeepSpeedFPConfig)
from vllm.model_executor.layers.quantization.fp8 import Fp8Config
from vllm.model_executor.layers.quantization.gptq import GPTQConfig
from vllm.model_executor.layers.quantization.gptq_marlin import (
Expand All @@ -22,6 +24,7 @@
"gptq_marlin": GPTQMarlinConfig,
"marlin": MarlinConfig,
"sparseml": CompressedTensorsConfig
"deepspeedfp": DeepSpeedFPConfig
}


Expand Down
6 changes: 4 additions & 2 deletions vllm/model_executor/models/llama.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,13 +66,13 @@ def __init__(
layer_name=f"{parent_name}.gate_up_proj",
input_size=hidden_size,
output_sizes=[intermediate_size] * 2,
bias=False,
bias=bias,
quant_config=quant_config)
self.down_proj = RowParallelLinear(
layer_name=f"{parent_name}.down_proj",
input_size=intermediate_size,
output_size=hidden_size,
bias=False,
bias=bias,
quant_config=quant_config)
if hidden_act != "silu":
raise ValueError(f"Unsupported activation: {hidden_act}. "
Expand Down Expand Up @@ -285,8 +285,10 @@ def __init__(
self.layers = nn.ModuleList([
LlamaDecoderLayer(parent_name=f"model.layers.{idx}",
config=config,
cache_config=cache_config,
quant_config=quant_config)
for idx in range(config.num_hidden_layers)

])
self.norm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps)

Expand Down
Loading
You are viewing a condensed version of this merge commit. You can view the full changes here.