adding Context Length Specialization (CCL) #466

quic-vjanfaza · 2025-06-19T18:13:08Z

Context-Length-Specialization technique optimizes the throughput of large language models (LLMs) on Qualcomm devices when handling very large context lengths. The current Ahead Of Time (AOT) compilation on Qualcomm devices doesn't predict the number of tokens needed, leading to significant throughput drops during the prefilling and the decoding phases. This happens because the system performs attention calculations based on large context length. To address this issue, we introduce Compute Context Length (CCL), an additional ONNX variable that allows for dynamic context-length specialization. By generating tokens using smaller, more manageable context lengths (CCL), we optimize memory reads and attention calculations, thereby improving throughput.

Signed-off-by: vjanfaza <vjanfaza@apex-scl01-giga-linux.qualcomm.com>

ochougul · 2025-06-24T06:38:59Z

tests/transformers/test_compute_context_length.py

how much time does the tests take?
We can choose to only test one model per KV type i.e. for chunked, hybrid, sliding window etc.

chunked -> global + local -> llama4
hybrid -> sliding window + global -> gemma3
sliding window -> mistral

For the above categories, we need to handle CCL in a different way.
Probably for local or sliding window layers, the complete CCL won't apply when it goes beyond sliding window length

And full CCL applies for global layers.

This support needs to be added.

quic-amitraj · 2025-06-24T06:28:19Z

QEfficient/cloud/infer.py

@@ -102,6 +102,7 @@ def main(
    full_batch_size: Optional[int] = None,
    prompt_len: int = 32,
    ctx_len: int = 128,
+    comp_ctx_lengths: Optional[List[int]] = None,


Add this in the docstring as well of the function.

quic-amitraj · 2025-06-24T06:44:37Z

QEfficient/transformers/models/modeling_auto.py

@@ -1489,6 +1491,8 @@ def from_pretrained(

        kv_offload = kwargs.pop("kv_offload", None)



I think, comp_ctx_length should be handled as a explicit parameter in from_pretrained rather than handling inside the kwargs. After this there will be no need of pooping that var from kwargs and we can add proper docstring as well here.

quic-amitraj · 2025-06-24T06:44:55Z

QEfficient/transformers/models/modeling_auto.py

+        self.comp_ctx_lengths = kwargs.pop("comp_ctx_lengths", None)
+


Why, I don't think there is any need of this.

we can use kwargs.get instead of pop. We are planning to use this kwargs for creating the model hash

quic-amitraj · 2025-06-24T06:49:58Z

QEfficient/transformers/models/modeling_auto.py

+                for i in range(1, len(self.comp_ctx_lengths)):
+                    decode_spec = self.build_decode_specialization(
+                        prefill_seq_len=prefill_seq_len,
+                        ctx_len=ctx_len,


There is no need of if else condition, please handle this for loop inside the build_decode_specialization.

quic-amitraj · 2025-06-24T08:25:50Z

QEfficient/transformers/models/llama/modeling_llama.py

@@ -29,6 +30,16 @@
 from QEfficient.transformers.modeling_attn_mask_utils import _create_causal_mask


+@dataclass
+class QEffBaseModelOutputWithPast(BaseModelOutputWithPast):


As these data class is common across all the modelling file, better to keep it in modeling_utils.py.

adding Context Length Specialization (CCL)

f34e508

Signed-off-by: vjanfaza <vjanfaza@apex-scl01-giga-linux.qualcomm.com>

quic-vjanfaza requested review from quic-rishinr, ochougul, quic-hemagnih and quic-amitraj as code owners June 19, 2025 18:13

adding Context Length Specialization (CCL)

0407788

Signed-off-by: vjanfaza <vjanfaza@apex-scl01-giga-linux.qualcomm.com>

quic-rishinr added 1.20.0 1.21.0 labels Jun 24, 2025

ochougul requested changes Jun 24, 2025

View reviewed changes

quic-amitraj requested changes Jun 24, 2025

View reviewed changes

quic-amitraj reviewed Jun 25, 2025

View reviewed changes

quic-rishinr requested review from quic-amitraj and ochougul June 25, 2025 07:14

quic-hemagnih removed the 1.20.0 label Jul 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

adding Context Length Specialization (CCL) #466

adding Context Length Specialization (CCL) #466

quic-vjanfaza commented Jun 19, 2025 •

edited

Loading

Uh oh!

ochougul Jun 24, 2025

Uh oh!

quic-amitraj Jun 24, 2025

Uh oh!

quic-amitraj Jun 24, 2025

Uh oh!

quic-amitraj Jun 24, 2025

Uh oh!

quic-rishinr Jun 25, 2025 •

edited

Loading

Uh oh!

quic-amitraj Jun 24, 2025

Uh oh!

quic-amitraj Jun 24, 2025

Uh oh!

Uh oh!

		@@ -1489,6 +1491,8 @@ def from_pretrained(

		kv_offload = kwargs.pop("kv_offload", None)

		self.comp_ctx_lengths = kwargs.pop("comp_ctx_lengths", None)

adding Context Length Specialization (CCL) #466

Are you sure you want to change the base?

adding Context Length Specialization (CCL) #466

Conversation

quic-vjanfaza commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ochougul Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

quic-amitraj Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

quic-amitraj Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

quic-amitraj Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

quic-rishinr Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

quic-amitraj Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

quic-amitraj Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

quic-vjanfaza commented Jun 19, 2025 •

edited

Loading

quic-rishinr Jun 25, 2025 •

edited

Loading