Skip to content

Commit 6a42e09

Browse files
elfieggmgoin
authored andcommitted
[Core] Default to using per_token quantization for fp8 when cutlass is supported. (vllm-project#8651)
Signed-off-by: mgoin <michael@neuralmagic.com> Co-authored-by: Michael Goin <mgoin@redhat.com> Co-authored-by: mgoin <michael@neuralmagic.com>
1 parent 96ce0fb commit 6a42e09

File tree

1 file changed

+2
-1
lines changed
  • vllm/model_executor/layers/quantization

1 file changed

+2
-1
lines changed

vllm/model_executor/layers/quantization/fp8.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -355,7 +355,8 @@ def apply(self,
355355
input_scale=layer.input_scale,
356356
bias=bias,
357357
cutlass_fp8_supported=self.cutlass_fp8_supported,
358-
use_per_token_if_dynamic=False)
358+
# Default to using per_token quantization if cutlass is supported
359+
use_per_token_if_dynamic=self.cutlass_fp8_supported)
359360

360361

361362
class Fp8MoEMethod(FusedMoEMethodBase):

0 commit comments

Comments
 (0)