Skip to content

Conversation

Echo-Nie
Copy link
Contributor

@Echo-Nie Echo-Nie commented Sep 15, 2025

add test_weight_only.py

Input shape: [2, 16]
Weight int8 shape: [16, 16]
Weight scale shape: [16]
Bias shape: [16]
Int32 accum stats: min=-430.19, max=523.19, mean=63.72
Scaled stats (after weight_scale): min=-7.14, max=9.53, mean=1.14
Final output (after bias) stats: min=-8.45, max=8.66, mean=1.37


=== Reference float output ===
[[ 2.6108642 -0.6497458  7.0613937  2.0976558  8.218029   4.234454
   2.2585444  0.29855    2.5939693 -3.258247   4.2500834  1.8264745
   2.0854921 -3.9599     8.660601   2.673855 ]
 [ 3.8692675  2.6785803 -2.6663718  5.2733607 -8.454243   1.9710996
  -2.3115897  3.712538  -0.7427913 -0.841658  -0.1247133  3.0144439
  -2.3901563  5.688132  -1.8924255 -3.824362 ]]
=== Quantized output ===
[[ 2.611  -0.65    7.062   2.098   8.22    4.234   2.258   0.2986  2.594
  -3.258   4.25    1.826   2.086  -3.959   8.664   2.674 ]
 [ 3.87    2.678  -2.666   5.273  -8.45    1.971  -2.312   3.713  -0.7427
  -0.842  -0.1247  3.014  -2.39    5.688  -1.893  -3.824 ]]
Max abs diff: 0.0034618378 , Mean abs diff: 0.00049396604
..
----------------------------------------------------------------------
Ran 4 tests in 0.127s

OK

Copy link

paddle-bot bot commented Sep 15, 2025

Thanks for your contribution!

Comment on lines +90 to +110
def test_apply_numerical_precision(self):
"""Test numerical precision of quantized output"""
x = paddle.to_tensor(np.random.randn(2, self.in_features).astype("float16"))

# Reference FP32 output
ref_out = paddle.matmul(
x.astype("float32"),
(self.layer.weight.astype("float32") * self.layer.weight_scale.astype("float32")).transpose([1, 0]),
)
if self.layer.bias is not None:
ref_out += self.layer.bias.astype("float32")

# Manual quantized output
weight_f32 = self.layer.weight.astype("float32")
x_f32 = x.astype("float32")
quant_out = paddle.matmul(x_f32, weight_f32 * self.layer.weight_scale.astype("float32"), transpose_y=True)
if self.layer.bias is not None:
quant_out += self.layer.bias.astype("float32")
quant_out = quant_out.astype("float16")

np.testing.assert_allclose(ref_out.numpy(), quant_out.numpy(), rtol=1e-2, atol=1e-2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里好像并没有检查self.method.apply的结果

Comment on lines +20 to +21
from fastdeploy.model_executor.layers.quantization.weight_only import (
GPUWeightOnlyLinearMethod,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weight_only.py 除了 GPUWeightOnlyLinearMethod,还有 MacheteWeightOnlyLinearMethod,使用的kernel实现方式不同,如果是H卡的环境也可以加一下MacheteWeightOnlyLinearMethod的case

self.fd_config.load_config = type("load_config", (), {"load_choices": "default_v1"})()

# float32 weights
weight_fp32 = np.random.randn(*self.weight_shape).astype("float32")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

权重一般是fp16或bf16类型

Comment on lines +44 to +49
# Per-channel scale
max_abs = np.max(np.abs(weight_fp32), axis=1, keepdims=True) + 1e-6
scale = (max_abs / 127.0).astype("float32")

# Int8 quantization
weight_int8 = np.clip(np.round(weight_fp32 / scale), -128, 127).astype("int8")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里建议调用GPUWeightOnlyLinearMethod里的process_loaded_weights进行权重的量化

return paddle.create_parameter(shape=shape, dtype=dtype, default_initializer=default_initializer)


class TestGPUWeightOnlyLinearMethod(unittest.TestCase):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

除了对apply方法的检查,也希望增加对process_prequanted_weights、process_loaded_weights方法的检查

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants