Closed
Description
When trying to quantize a model the exepction is raised:
TorchRuntimeError: Failed running call_function <built-in function linear>(*(FakeTensor(..., device='cuda:0', size=(2, 32)), LinearActivationQuantizedTensor(AffineQuantizedTensor(layout_tensor=Float8AQTLayout(
float8_data=FakeTensor(..., device='cuda:0', size=(15, 32), dtype=torch.float8_e4m3fn),
scale=FakeTensor(..., device='cuda:0', size=()),
transposed=False, layout_type=Float8LayoutType(mm_config=Float8MMConfig(emulate=False, use_fast_accum=True, pad_inner_dim=False))), block_size=torch.Size([15, 32]), shape=torch.Size([15, 32]), device=cuda:0, dtype=torch.float32, requires_grad=False), functools.partial(<function _input_activation_quant_func_fp8 at 0x7a94b4f4d120>, activation_granularity=PerTensor(), activation_dtype=torch.float8_e4m3fn)), Parameter(FakeTensor(..., device='cuda:0', size=(15,), requires_grad=True))), **{}):
Expected both dimensions of mat2 to be divisble by 16 but got torch.Size([32, 15])
Minimal code to reproduce the issue:
import torch
from torchao.quantization import (
float8_dynamic_activation_float8_weight,
quantize_,
)
dim1 = 32
dim2 = 15
class ToyModel(torch.nn.Module):
def __init__(self):
super().__init__()
self.model = torch.nn.Linear(dim1, dim2)
def forward(self, x):
return self.model(x)
model = ToyModel().to("cuda").eval()
quantize_(model, float8_dynamic_activation_float8_weight())
model = torch.compile(model=model, fullgraph=True, mode="max-autotune")
model(torch.randn(2, 32).to('cuda'))
Is this by design or is it a bug? Currently this prevents many models to be quantized.
Metadata
Metadata
Assignees
Labels
No labels