Skip to content

Conversation

@glistening
Copy link
Contributor

@glistening glistening commented Jul 17, 2025

It introduces circle_attention op, and add tests which fuse attention from LlamaDecoderLayers.

TICO-DCO-1.0-Signed-off-by: Sanggyu Lee sg5.lee@samsung.com

It is another fuser. It does not use pattern match, but it adds an operator using decorator.

To generate tinyllama circle, run:

python test/modules/model/LlamaDecoderLayerWithKVCache/layer.py  # 1 decoder-layer only

or

python test/modules/model/LlamaDecoderLayerWithKVCache/layers.py  # All decoder-layers

@glistening glistening changed the title [test+operators] Fuse attention to circle attention [test+operators] Fuse llama attention to circle attention Jul 17, 2025
@glistening glistening force-pushed the attention branch 6 times, most recently from a255e2f to 2cf5d50 Compare July 18, 2025 04:45
@glistening
Copy link
Contributor Author

python test/modules/model/LlamaDecoderLayerWithKVCache/layer.py 
3

@glistening
Copy link
Contributor Author

glistening commented Jul 21, 2025

python test/modules/model/LlamaDecoderLayerWithKVCache/layers.py

repeats the decoder layers with the almost same inputs (attention_mask, ...).

4

@glistening glistening force-pushed the attention branch 4 times, most recently from 00596a2 to 4ef9092 Compare July 23, 2025 02:58
Comment on lines 27 to 58
from tico.utils.record_input import RecordingInput

# past_key_values
# ---------------
# During prefill, "past_key_values" not None, but an empty Cache instance.
# Passing None makes torch.export happy.

# attention_mask, cache_position
# ------------------------------
# For npu, ignore captured values generated from example prompt.

input_to_remove = ["past_key_values", "attention_mask", "cache_position"]

with torch.no_grad(), RecordingInput(model, input_to_remove=input_to_remove) as rec:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only change to capture input.

do_sample=False,
pad_token_id=tokenizer.eos_token_id,
)
captured_input = rec.captured_input
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retrieve captured_input


model = AutoModelForCausalLM.from_pretrained(model_name)
model.eval()
circle_model = tico.convert(model, captured_input)
Copy link
Contributor Author

@glistening glistening Jul 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, pass captured_input to tico.convert.

@glistening
Copy link
Contributor Author

glistening commented Jul 28, 2025

7/28 Offline Code Review

  • Assumption of [ k_cache is 1 continuous tensor of concatening k_cache_0, k_cache_1, ...]; is not valid for quantized model when each layer's k_cache has difference qparam. (Same for q_cache)
    • I will keep this structure since it is f-circle for cpu, not q-circle for npu.
    • If I use q-circle for cpu, it will not q8 circle, but block-quantization like ggml_q4.
  • remove_unused_input may have side-effect if ExportedProgram uses the unused placeholder. (What side-effect? No one knows, but it is not required pass. I just want to make the graph tidy. I will not push remove_unused_input pass.
  • op_circle_attention seems onert-specific. It will be put under serialize/operators/onert as op_attention.
  • It is okay to put record_input.py since it does not break something before using this.
  • New test case which uses record_input.py will use function unittest, instead of module unittest (which requires get_example_inputs).

@glistening
Copy link
Contributor Author

glistening commented Jul 28, 2025

I took a look into ExportedProgram from pytorch source. If we are not certain about unused input removing in torch IR level, it would be better to do in circle2circle in ONE or other circle modification tool. (cc @llFreetimell). Circle is preferred in several perspective.

@glistening glistening force-pushed the attention branch 4 times, most recently from 5a35218 to 2501221 Compare July 31, 2025 00:51
return hidden_states


@torch.library.register_fake("circle::attention.llama")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this,

torch._dynamo.exc.Unsupported: Operator does not support running with fake tensors

  Developer debug context: unsupported operator: circle.attention.llama

@glistening
Copy link
Contributor Author

glistening commented Aug 13, 2025

  • Added decode.py for the whole model for decoding.
  • Removed attention_mask from fused op_attention because:
    • I am thinking of running decode phase on cpu, not npu. Therefore I assume there is no such thing of pad token either left or right of actual tokens.
  • do not assume kv_cache.key_cache is 1 big chunk any longer. Each layer has its own k, v cache-tensors.
  • Remove "CPU" dummy implementation in op_attention.py. @seockho-kim informs me it works without this while he is working on RMSNorm op.
  • No Optional for op_attention. (attention_mask is eliminated. Cache is mandatory since I determined to focuse on {decoding on cpu, prefill on npu}.

Sanggyu Lee added 19 commits November 4, 2025 14:09
It introduces circle_attention op, and add tests which fuse attention
from LlamaDecoderLayers.

TICO-DCO-1.0-Signed-off-by: Sanggyu Lee <sg5.lee@samsung.com>
model.py for LlamaModel
layer.py for LlamaDecoderLayer
library.implf for "CPU" turned out not necessary.
Causal attention_mask can be calculated in op_attention.
However, op_attention is not the proper place because the cos and sin
table needs to be calculated again and again in each attention layer
while the cos and sin table is sharable between decode layers.
@glistening glistening force-pushed the attention branch 2 times, most recently from ae241d2 to 7ed0d2a Compare November 4, 2025 05:12
@glistening
Copy link
Contributor Author

#400

@glistening glistening closed this Nov 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant