-
Notifications
You must be signed in to change notification settings - Fork 22
[test+operators] Fuse llama attention to circle attention #217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
a255e2f to
2cf5d50
Compare
00596a2 to
4ef9092
Compare
| from tico.utils.record_input import RecordingInput | ||
|
|
||
| # past_key_values | ||
| # --------------- | ||
| # During prefill, "past_key_values" not None, but an empty Cache instance. | ||
| # Passing None makes torch.export happy. | ||
|
|
||
| # attention_mask, cache_position | ||
| # ------------------------------ | ||
| # For npu, ignore captured values generated from example prompt. | ||
|
|
||
| input_to_remove = ["past_key_values", "attention_mask", "cache_position"] | ||
|
|
||
| with torch.no_grad(), RecordingInput(model, input_to_remove=input_to_remove) as rec: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only change to capture input.
| do_sample=False, | ||
| pad_token_id=tokenizer.eos_token_id, | ||
| ) | ||
| captured_input = rec.captured_input |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Retrieve captured_input
|
|
||
| model = AutoModelForCausalLM.from_pretrained(model_name) | ||
| model.eval() | ||
| circle_model = tico.convert(model, captured_input) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then, pass captured_input to tico.convert.
7/28 Offline Code Review
|
|
I took a look into |
5a35218 to
2501221
Compare
| return hidden_states | ||
|
|
||
|
|
||
| @torch.library.register_fake("circle::attention.llama") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this,
torch._dynamo.exc.Unsupported: Operator does not support running with fake tensors
Developer debug context: unsupported operator: circle.attention.llama
|
c1a9ca9 to
375c62f
Compare
It introduces circle_attention op, and add tests which fuse attention from LlamaDecoderLayers. TICO-DCO-1.0-Signed-off-by: Sanggyu Lee <sg5.lee@samsung.com>
model.py for LlamaModel layer.py for LlamaDecoderLayer
library.implf for "CPU" turned out not necessary.
Causal attention_mask can be calculated in op_attention. However, op_attention is not the proper place because the cos and sin table needs to be calculated again and again in each attention layer while the cos and sin table is sharable between decode layers.
ae241d2 to
7ed0d2a
Compare


It introduces circle_attention op, and add tests which fuse attention from LlamaDecoderLayers.
TICO-DCO-1.0-Signed-off-by: Sanggyu Lee sg5.lee@samsung.com
It is another fuser. It does not use pattern match, but it adds an operator using decorator.
To generate tinyllama circle, run:
python test/modules/model/LlamaDecoderLayerWithKVCache/layer.py # 1 decoder-layer onlyor
python test/modules/model/LlamaDecoderLayerWithKVCache/layers.py # All decoder-layers