Updates torchao pin to enable shared embedding quantization#9548
Updates torchao pin to enable shared embedding quantization#9548
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/9548
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New FailureAs of commit 1d7cc21 with merge base 94ec549 ( NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
| ``` | ||
|
|
||
| A few notes: | ||
| - If your model shares embedding/unembedding weights (like Llama1B and Llama3B do), you can add `--use_shared_embedding` to take advantage of this and reduce memory. When this option is enabled, you can specify whether embeddings are quantized with weight zeros or not by specifying a third argument. For example, `-E "torchao:4,32,true"` means that the embedding is quantized to 4-bits with group_size=32 and uses weight zeros (this is the default behavior if you simply use `-E "torchao:4,32"`), whereas `-E "torchao:4,32,false"` means that the embedding is quantized to 4-bits with group_size=32, but is quantized with scales-only. If `--use_shared_embedding` is specified, the unembedding (i.e., the final linear layer) is quantized in the same way, but also uses 8-bit dynamically quantized activations. |
There was a problem hiding this comment.
Not for this PR, but what's the plan for updating our arg selection scheme for quant?
-E "torchao:4,32,true isn't user friendly
There was a problem hiding this comment.
You'd never need to do that. true is the default (and existing behavior), so you could continue to use -E"torchao:4,32".
There was a problem hiding this comment.
I'd make this a bit more clear that shared is only for torchao kernels, or torchao:
There was a problem hiding this comment.
It's under the torchao section of the docs.
| if args.use_shared_embedding: | ||
| if not ( | ||
| args.embedding_quantize is not None | ||
| and args.embedding_quantize.startswith("torchao:") |
There was a problem hiding this comment.
| if args.use_shared_embedding: | |
| if not ( | |
| args.embedding_quantize is not None | |
| and args.embedding_quantize.startswith("torchao:") | |
| if args.use_shared_embedding: | |
| and ( | |
| args.embedding_quantize is None | |
| or not args.embedding_quantize.startswith("torchao:") |
nit: nested conditionals into an error
|
|
||
| transforms.append(inject_fast_hadamard_transform_native_for_spin_quant) | ||
|
|
||
| if args.embedding_quantize: |
There was a problem hiding this comment.
Why did we change the order of the source transform?
There was a problem hiding this comment.
shared_embedding must be applied before linear. So I changed order to embedding first, and linear second. I put a code comment to this effect as well.
| ``` | ||
|
|
||
| A few notes: | ||
| - If your model shares embedding/unembedding weights (like Llama1B and Llama3B do), you can add `--use_shared_embedding` to take advantage of this and reduce memory. When this option is enabled, you can specify whether embeddings are quantized with weight zeros or not by specifying a third argument. For example, `-E "torchao:4,32,true"` means that the embedding is quantized to 4-bits with group_size=32 and uses weight zeros (this is the default behavior if you simply use `-E "torchao:4,32"`), whereas `-E "torchao:4,32,false"` means that the embedding is quantized to 4-bits with group_size=32, but is quantized with scales-only. If `--use_shared_embedding` is specified, the unembedding (i.e., the final linear layer) is quantized in the same way, but also uses 8-bit dynamically quantized activations. |
There was a problem hiding this comment.
I'd make this a bit more clear that shared is only for torchao kernels, or torchao:
Updates torchao pin to enable shared embedding quantization.