QLoRA Worse Memory When linear_nf4 Used on Output

When I replace the output layer for llama3.1 70B 

`nn.Linear(8192, 128_256, bias=False)` with  `FrozenNF4Linear(8192, 128_256, bias=False)`

in torchtune, I surprisingly end up using a lot more memory. Leaving the output layer in bf16 results in the training run using ~43gb of peak memory active, while quantizing the output results in ~52gb active. I wonder if this is due to the large size of the output layer.

Steps to reproduce:
- Replace nn.Linear with FrozenNF4Linear in the model [here](https://github.com/pytorch/torchtune/blob/3518492f43a8a5a462cbd604be4101268ff5bd52/torchtune/models/llama3_1/_component_builders.py#L255) ([FrozenNF4Linear](https://github.com/pytorch/torchtune/blob/main/torchtune/modules/low_precision/nf4_linear.py) is just a linear_nf4 wrapper)
- tune conifg [here](https://gist.github.com/pbontrager/281910f3bcd23e09ebdbc4659f0fd054)
- command: `tune run lora_finetune_single_device --config ./70B_qlora_long_context.yaml` tokenizer.max_seq_len=8192

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QLoRA Worse Memory When linear_nf4 Used on Output #1433

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

QLoRA Worse Memory When linear_nf4 Used on Output #1433

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions