[test+operators] Fuse llama attention to circle attention #217

glistening · 2025-07-17T07:10:04Z

It introduces circle_attention op, and add tests which fuse attention from LlamaDecoderLayers.

TICO-DCO-1.0-Signed-off-by: Sanggyu Lee sg5.lee@samsung.com

It is another fuser. It does not use pattern match, but it adds an operator using decorator.

To generate tinyllama circle, run:

python test/modules/model/LlamaDecoderLayerWithKVCache/layer.py  # 1 decoder-layer only

or

python test/modules/model/LlamaDecoderLayerWithKVCache/layers.py  # All decoder-layers

glistening · 2025-07-21T01:01:29Z

python test/modules/model/LlamaDecoderLayerWithKVCache/layer.py

glistening · 2025-07-21T01:08:47Z

python test/modules/model/LlamaDecoderLayerWithKVCache/layers.py

repeats the decoder layers with the almost same inputs (attention_mask, ...).

glistening · 2025-07-23T02:58:59Z

test/modules/model/LlamaDecoderLayerWithKVCacheAndFusedAttention/prefill.py

+from tico.utils.record_input import RecordingInput
+
+# past_key_values
+# ---------------
+# During prefill, "past_key_values" not None, but an empty Cache instance.
+# Passing None makes torch.export happy.
+
+# attention_mask, cache_position
+# ------------------------------
+# For npu, ignore captured values generated from example prompt.
+
+input_to_remove = ["past_key_values", "attention_mask", "cache_position"]
+
+with torch.no_grad(), RecordingInput(model, input_to_remove=input_to_remove) as rec:


The only change to capture input.

glistening · 2025-07-23T02:59:21Z

test/modules/model/LlamaDecoderLayerWithKVCacheAndFusedAttention/prefill.py

+        do_sample=False,
+        pad_token_id=tokenizer.eos_token_id,
+    )
+    captured_input = rec.captured_input


Retrieve captured_input

glistening · 2025-07-23T02:59:38Z

test/modules/model/LlamaDecoderLayerWithKVCacheAndFusedAttention/prefill.py

+
+model = AutoModelForCausalLM.from_pretrained(model_name)
+model.eval()
+circle_model = tico.convert(model, captured_input)


Then, pass captured_input to tico.convert.

glistening · 2025-07-28T03:21:35Z

7/28 Offline Code Review

Assumption of [ k_cache is 1 continuous tensor of concatening k_cache_0, k_cache_1, ...]; is not valid for quantized model when each layer's k_cache has difference qparam. (Same for q_cache)
- I will keep this structure since it is f-circle for cpu, not q-circle for npu.
- If I use q-circle for cpu, it will not q8 circle, but block-quantization like ggml_q4.
remove_unused_input may have side-effect if ExportedProgram uses the unused placeholder. (What side-effect? No one knows, but it is not required pass. I just want to make the graph tidy. I will not push remove_unused_input pass.
op_circle_attention seems onert-specific. It will be put under serialize/operators/onert as op_attention.
It is okay to put record_input.py since it does not break something before using this.
New test case which uses record_input.py will use function unittest, instead of module unittest (which requires get_example_inputs).

glistening · 2025-07-28T21:27:48Z

I took a look into ExportedProgram from pytorch source. If we are not certain about unused input removing in torch IR level, it would be better to do in circle2circle in ONE or other circle modification tool. (cc @llFreetimell). Circle is preferred in several perspective.

glistening · 2025-08-04T05:22:14Z

tico/serialize/operators/onert/op_attention.py

+    return hidden_states
+
+
+@torch.library.register_fake("circle::attention.llama")


Without this,

torch._dynamo.exc.Unsupported: Operator does not support running with fake tensors Developer debug context: unsupported operator: circle.attention.llama

glistening · 2025-08-13T08:59:08Z

Added decode.py for the whole model for decoding.
Removed attention_mask from fused op_attention because:
- I am thinking of running decode phase on cpu, not npu. Therefore I assume there is no such thing of pad token either left or right of actual tokens.
do not assume kv_cache.key_cache is 1 big chunk any longer. Each layer has its own k, v cache-tensors.
Remove "CPU" dummy implementation in op_attention.py. @seockho-kim informs me it works without this while he is working on RMSNorm op.
No Optional for op_attention. (attention_mask is eliminated. Cache is mandatory since I determined to focuse on {decoding on cpu, prefill on npu}.

It introduces circle_attention op, and add tests which fuse attention from LlamaDecoderLayers. TICO-DCO-1.0-Signed-off-by: Sanggyu Lee <sg5.lee@samsung.com>

model.py for LlamaModel layer.py for LlamaDecoderLayer

library.implf for "CPU" turned out not necessary.

Causal attention_mask can be calculated in op_attention. However, op_attention is not the proper place because the cos and sin table needs to be calculated again and again in each attention layer while the cos and sin table is sharable between decode layers.

glistening · 2025-11-26T01:29:51Z

#400

glistening changed the title ~~[test+operators] Fuse attention to circle attention~~ [test+operators] Fuse llama attention to circle attention Jul 17, 2025

glistening force-pushed the attention branch 6 times, most recently from a255e2f to 2cf5d50 Compare July 18, 2025 04:45

glistening mentioned this pull request Jul 18, 2025

Convert LlamaAttention as 1 op (attention) #160

Open

glistening force-pushed the attention branch 4 times, most recently from 00596a2 to 4ef9092 Compare July 23, 2025 02:58

glistening commented Jul 23, 2025

View reviewed changes

glistening mentioned this pull request Jul 24, 2025

Auto example input generation #207

Open

glistening mentioned this pull request Jul 28, 2025

[utils] Add forward's input recorder #249

Merged

glistening force-pushed the attention branch 4 times, most recently from 5a35218 to 2501221 Compare July 31, 2025 00:51

glistening commented Aug 4, 2025

View reviewed changes

seockho-kim mentioned this pull request Aug 6, 2025

Fuse LlamaRMSNorm class to Circle RMSNorm op #266

Merged

glistening force-pushed the attention branch from 3d87b72 to 0a037b2 Compare August 13, 2025 08:55

This was referenced Sep 11, 2025

[onert] Introduce Attention operator Samsung/ONE#16055

Closed

[onert/llm] Support tinyllama model Samsung/ONE#15627

Closed

batchmatmul with lhs constant to fullyconnected #339

Closed

glistening force-pushed the attention branch from c1a9ca9 to 375c62f Compare September 16, 2025 04:54

glistening mentioned this pull request Oct 23, 2025

[onert] Add Attention operator in circle_schema.fbs Samsung/ONE#16227

Merged

This was referenced Nov 4, 2025

[test] Add LlamaDecoderLayerWithKVCache using captured input #215

Closed

[test] Add LlamaPrefill using captured input #208

Closed

Sanggyu Lee added 19 commits November 4, 2025 14:09

[test+operators] Fuse attention to circle attention

a2057f4

It introduces circle_attention op, and add tests which fuse attention from LlamaDecoderLayers. TICO-DCO-1.0-Signed-off-by: Sanggyu Lee <sg5.lee@samsung.com>

Rename model.py to layer.py

375bcd7

model.py for LlamaModel layer.py for LlamaDecoderLayer

make lint happy by making code ugly

db5c2b5

Fix local-silent but CI-loud lint error

4e578a6

Add wq,wk,wv,wo and remove_unused_input pass

c7c6b79

Use recording_input in layer.py

abf1288

Add prefill.py

22fb522

Update layer.py

711c60d

Rename model.py to layers.py

9fa791f

Update input_to_remove comment for prefill.py

6789671

Factor out attention fuser to op_circle_attention.py

bf53424

move op_circle_attention.py to onert/op_attention.py

275f398

remove unused import from op_attention.py

3aaca06

Adjust input prompt size and kv_cache size = 12

a04ff6f

remove @torch.library.impl("circle::attention.llama", "CPU")

b190777

library.implf for "CPU" turned out not necessary.

Remove attention_mask and make kv_cache mandatory, not optional

e40fad5

add decode.py to export LlamaModel decode phase

8d9c2f7

Restore attention_mask

72cd407

Causal attention_mask can be calculated in op_attention. However, op_attention is not the proper place because the cos and sin table needs to be calculated again and again in each attention layer while the cos and sin table is sharable between decode layers.

Fix wrong arg order and move layer_idx from inputs to params

22910c2

glistening force-pushed the attention branch 2 times, most recently from ae241d2 to 7ed0d2a Compare November 4, 2025 05:12

Sanggyu Lee added 2 commits November 4, 2025 14:30

remove layer_idx

8ac74b7

Remove remove_unused_inputs pass

9efa26f

glistening force-pushed the attention branch from 7ed0d2a to 9efa26f Compare November 4, 2025 05:31

glistening closed this Nov 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[test+operators] Fuse llama attention to circle attention #217

[test+operators] Fuse llama attention to circle attention #217

Uh oh!

glistening commented Jul 17, 2025 •

edited

Loading

Uh oh!

glistening commented Jul 21, 2025

Uh oh!

glistening commented Jul 21, 2025 •

edited

Loading

Uh oh!

glistening Jul 23, 2025

Uh oh!

glistening Jul 23, 2025

Uh oh!

glistening Jul 23, 2025 •

edited

Loading

Uh oh!

glistening commented Jul 28, 2025 •

edited

Loading

Uh oh!

glistening commented Jul 28, 2025 •

edited

Loading

Uh oh!

glistening Aug 4, 2025

Uh oh!

glistening commented Aug 13, 2025 •

edited

Loading

Uh oh!

glistening commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		return hidden_states


		@torch.library.register_fake("circle::attention.llama")

[test+operators] Fuse llama attention to circle attention #217

[test+operators] Fuse llama attention to circle attention #217

Uh oh!

Conversation

glistening commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

glistening commented Jul 21, 2025

Uh oh!

glistening commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

glistening Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

glistening Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

glistening Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glistening commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

7/28 Offline Code Review

Uh oh!

glistening commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

glistening Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

glistening commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

glistening commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

glistening commented Jul 17, 2025 •

edited

Loading

glistening commented Jul 21, 2025 •

edited

Loading

glistening Jul 23, 2025 •

edited

Loading

glistening commented Jul 28, 2025 •

edited

Loading

glistening commented Jul 28, 2025 •

edited

Loading

glistening commented Aug 13, 2025 •

edited

Loading