NF4 quantization slower on 0.3 vs 0.1

Hi, we're observing a slowdown in our torchtune QLoRA recipe initialization after changing from version 0.1 to 0.3 (I haven't checked 0.4 yet but will do so shortly). This was first pointed out in https://github.com/pytorch/torchtune/issues/1246 and I believe the cause is coming from some changes in torchao. 

Repro: from a torchtune git install

```
# Just some commit hash from right before we upgraded to 0.3
git checkout 52e328337579e9b84ba7f2448b29a6de7c5d8db3
pip install torchao==0.1

# Save time.perf_counter() on init and then log the delta with perf_counter()
# here: https://github.com/pytorch/torchtune/blob/0a407712eda252573326074d33af0a66c2d2990e/recipes/lora_finetune_single_device.py#L539
tune run lora_finetune_single_device --config llama3/8B_qlora_single_device
>>> 15.1960636760341

# Do the same on 0.3
pip install torchao==0.3
# also need to comment some quant APIs out to fix import errors
tune run lora_finetune_single_device --config llama3/8B_qlora_single_device
>>> 95.78260190901347
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NF4 quantization slower on 0.3 vs 0.1 #642

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

NF4 quantization slower on 0.3 vs 0.1 #642

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions