Skip to content

fp8dq requires both dimensions to be divisible by 16 #1268

Closed
@piotr-bazan-nv

Description

@piotr-bazan-nv

When trying to quantize a model the exepction is raised:

TorchRuntimeError: Failed running call_function <built-in function linear>(*(FakeTensor(..., device='cuda:0', size=(2, 32)), LinearActivationQuantizedTensor(AffineQuantizedTensor(layout_tensor=Float8AQTLayout(
float8_data=FakeTensor(..., device='cuda:0', size=(15, 32), dtype=torch.float8_e4m3fn),
scale=FakeTensor(..., device='cuda:0', size=()),
transposed=False, layout_type=Float8LayoutType(mm_config=Float8MMConfig(emulate=False, use_fast_accum=True, pad_inner_dim=False))), block_size=torch.Size([15, 32]), shape=torch.Size([15, 32]), device=cuda:0, dtype=torch.float32, requires_grad=False), functools.partial(<function _input_activation_quant_func_fp8 at 0x7a94b4f4d120>, activation_granularity=PerTensor(), activation_dtype=torch.float8_e4m3fn)), Parameter(FakeTensor(..., device='cuda:0', size=(15,), requires_grad=True))), **{}):
Expected both dimensions of mat2 to be divisble by 16 but got torch.Size([32, 15])

Minimal code to reproduce the issue:

import torch
from torchao.quantization import (
    float8_dynamic_activation_float8_weight,
    quantize_,
)
dim1 = 32
dim2 = 15

class ToyModel(torch.nn.Module):

    def __init__(self):
        super().__init__()
        self.model = torch.nn.Linear(dim1, dim2)

    def forward(self, x):
        return self.model(x)

model = ToyModel().to("cuda").eval()

quantize_(model, float8_dynamic_activation_float8_weight())
model = torch.compile(model=model, fullgraph=True, mode="max-autotune")
model(torch.randn(2, 32).to('cuda'))

Is this by design or is it a bug? Currently this prevents many models to be quantized.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions