Skip to content

Conversation

@qraniumcitest
Copy link
Owner

No description provided.

Signed-off-by: qraniumcitest <rmakar@qti.qualcomm.com>
@qraniumcitest
Copy link
Owner Author

From GHES (comment) by @qgeniecodeassistant[bot]

Code Assistant

Reviewed Commits: cb7da87, efb34ea, 35d8fd8, 118100c, 7e8838f, 25236bb, b2dd328, be7511b, 04f1ad7, c75a637, ed965fd, c788f17, f4ff803, 44fe97b, a5056d7

Updated the correct code with updated syntax, removed device_group
parameter in model.compile()

Signed-off-by: Sharvari Medhe smedhe@qti.qualcomm.com

Signed-off-by: Mohit Soni mohisoni@qti.qualcomm.com

📢 Expanded On-Device Sampling Support in QEfficient

Excited to share that On-Device Sampling—previously available only
for LlamaForCausalLM—is now supported across a broader set of
architectures! This enhancement brings faster, more efficient inference
directly to the QAIC device.

✅ Newly Supported Architectures:

  1. FalconForCausalLM
  2. GemmaForCausalLM
  3. GPT2LMHeadModel
  4. GPTJForCausalLM
  5. GraniteForCausalLM
  6. GraniteMoeForCausalLM
  7. LlamaForCausalLM (existing)
  8. MptForCausalLM
  9. Phi3ForCausalLM
  10. Qwen2ForCausalLM

⚠️ Architectures Still Pending Support:

  1. GPTBigCodeForCausalLM
  2. InternVLChatModel
  3. MistralForCausalLM
  4. MixtralForCausalLM
  5. LlamaSwiftKVForCausalLM
  6. Grok1ModelForCausalLM

We’re actively working to extend support to these models. Contributions,
feedback, and testing from the community are always welcome to help
accelerate this effort!


Signed-off-by: quic-sanising quic_sanising@quicinc.com
Signed-off-by: sanising sanising@qti.qualcomm.com
Signed-off-by: Dhiraj Kumar Sah dhirajku@qti.qualcomm.com
Co-authored-by: sanising sanising@qti.qualcomm.com
Co-authored-by: Dhiraj Kumar Sah dhirajku@qti.qualcomm.com
Co-authored-by: Hem Agnihotri hemagnih@qti.qualcomm.com

Signed-off-by: meetkuma meetkuma@qti.qualcomm.com

Signed-off-by: Mamta Singh mamtsing@qti.qualcomm.com
Signed-off-by: Rishin Raj rishinr@qti.qualcomm.com
Signed-off-by: Asmita Goswami asmigosw@qti.qualcomm.com
Signed-off-by: Mohit Soni mohisoni@qti.qualcomm.com
Signed-off-by: vbaddi quic_vbaddi@quicinc.com
Co-authored-by: Mamta Singh mamtsing@qti.qualcomm.com
Co-authored-by: Asmita Goswami asmigosw@qti.qualcomm.com
Co-authored-by: Rishin Raj rishinr@qti.qualcomm.com
Co-authored-by: Mohit Soni mohisoni@qti.qualcomm.com
Co-authored-by: Vinayak Baddi vbaddi@qti.qualcomm.com

Signed-off-by: Mohit Soni mohisoni@qti.qualcom.com
Co-authored-by: Mohit Soni mohisoni@qti.qualcom.com

Signed-off-by: vbaddi quic_vbaddi@quicinc.com
Signed-off-by: Onkar Chougule ochougul@qti.qualcomm.com
Signed-off-by: Mamta Singh mamtsing@qti.qualcomm.com
Signed-off-by: Mamta Singh 168400541+quic-mamta@users.noreply.github.com
Co-authored-by: Vinayak Baddi quic_vbaddi@quicinc.com
Co-authored-by: Vinayak Baddi vbaddi@qti.qualcomm.com
Co-authored-by: Mamta Singh mamtsing@qti.qualcomm.com
Co-authored-by: Mamta Singh 168400541+quic-mamta@users.noreply.github.com

Signed-off-by: Varun Gupta vargupt@qti.qualcomm.com
Co-authored-by: Abhishek Kumar Singh sabhis@qti.qualcomm.com

Signed-off-by: Tanisha tchawada@qti.qualcomm.com

Signed-off-by: Tanisha tchawada@qti.qualcomm.com

This pull request is created for updating the onnx opset version to 17
from 13.

Testing

Below are the models I have tested:

Causal Models

  • TinyLlama/TinyLlama-1.1B-Chat-v1.0
  • gpt2
  • Salesforce/codegen-350M-mono
  • microsoft/Phi-3-mini-4k-instruct
  • tiiuae/falcon-7b
  • Qwen/Qwen2-0.5B
  • Qwen/Qwen3-0.6B
  • bigcode/starcoder2-3b
  • Qwen/Qwen3-30B-A3B-Instruct-2507
  • Felladrin/Minueza-32M-Base
  • wtang06/mpt-125m-c4
  • hakurei/gpt-j-random-tinier
  • mistralai/Mixtral-8x7B-Instruct-v0.1
  • meta-llama/Llama-3.2-1B
  • unsloth/gemma-2b
  • unsloth/gemma-2-2b
  • TheBloke/TinyLlama-1.1B-Chat-v0.3-AWQ
  • TheBloke/Llama-2-7B-GPTQ
  • ibm-granite/granite-20b-code-base
  • neuralmagic/Llama-3.2-3B-Instruct-FP8
  • neuralmagic/Qwen2-0.5B-Instruct-FP8
  • ibm-granite/granite-3.1-2b-instruct
  • ibm-granite/granite-guardian-3.1-2b
  • hpcai-tech/grok-1
  • Snowflake/Llama-3.1-SwiftKV-8B-Instruct
  • allenai/OLMo-2-0425-1B

Embedding Models

  • BAAI/bge-base-en-v1.5
  • BAAI/bge-large-en-v1.5
  • BAAI/bge-small-en-v1.5
  • intfloat/e5-large-v2
  • sentence-transformers/multi-qa-mpnet-base-cos-v1
  • ibm-granite/granite-embedding-30m-english
  • ibm-granite/granite-embedding-125m-english
  • BAAI/bge-reranker-v2-m3
  • ibm-granite/granite-embedding-107m-multilingual
  • ibm-granite/granite-embedding-278m-multilingual

Vision Models

  • llava-hf/llava-1.5-7b-hf
  • OpenGVLab/InternVL2_5-1B
  • meta-llama/Llama-3.2-11B-Vision-Instruct
  • ibm-granite/granite-vision-3.2-2b
  • meta-llama/Llama-4-Scout-17B-16E-Instruct
  • google/gemma-3-4b-it

Audio Models

  • openai/whisper-tiny
  • openai/whisper-base
  • openai/whisper-small
  • openai/whisper-medium
  • openai/whisper-large
  • openai/whisper-large-v3-turbo

Signed-off-by: Abukhoyer Shaik abukhoye@qti.qualcomm.com

Signed-off-by: Abukhoyer Shaik abukhoye@qti.qualcomm.com

Compute-Context-Length (CCL) technique optimizes the throughput of large
language models (LLMs) on Qualcomm devices when handling very large
context lengths. The current Ahead Of Time (AOT) compilation on Qualcomm
devices doesn't predict the number of tokens needed, leading to
significant throughput drops during the prefilling and the decoding
phases. This happens because the system performs attention calculations
based on large context length. To address this issue, we introduce
Compute Context Length (CCL), an additional ONNX variable that allows
for dynamic context-length specialization. By generating tokens using
smaller, more manageable context lengths (CCL), we optimize memory reads
and attention calculations, thereby improving throughput.


Signed-off-by: Vahid Janfaza vjanfaza@qti.qualcomm.com

  • 44fe97b: Create Mirror_Fork_PRs_to_GHES.yml

Signed-off-by: qraniumcitest rmakar@qti.qualcomm.com

  • a5056d7: Update test_modeling_qeff.py

Signed-off-by: qraniumcitest rmakar@qti.qualcomm.com

Pull Request Overview

This PR introduces Composable Context Length (CCL) support across the QEfficient codebase, enabling dynamic context length management for LLM inference. The changes span multiple model architectures, generation infrastructure, and workflow automation.

Files Changed Summary

File Lines Changed Issues Found Highest Severity
.github/workflows/Mirror_Fork_PRs_to_GHES.yml +235 3 High
QEfficient/cloud/infer.py +12 0 -
QEfficient/generation/cloud_infer.py +4 1 Medium
QEfficient/generation/embedding_handler.py +367 2 Medium
QEfficient/generation/text_generation_inference.py +89 2 Medium
QEfficient/generation/vlm_generation.py +800 1 Low
QEfficient/transformers/models/*/modeling_*.py ~2000+ 0 -
Various model files Multiple 0 -

Key Changes

  1. CCL Infrastructure: Added comp_ctx_lengths_prefill and comp_ctx_lengths_decode parameters throughout the generation pipeline
  2. Vision-Language Models: New VisionHandler and VisionLanguageGeneration classes for VLM support
  3. Model Architecture Updates: CCL support added to 20+ model architectures (Llama, Mistral, Gemma, etc.)
  4. GitHub Workflow: New workflow for mirroring fork PRs to GHES

Critical Issues Identified

  1. [SECURITY] Hardcoded credentials exposure in GitHub workflow (High)
  2. [FUNCTIONALITY] Missing error handling in vision processing (Medium)
  3. [PERFORMANCE] Inefficient session management in vision inference (Medium)
  4. [MAINTAINABILITY] Incomplete state tracking in generation classes (Medium)

[SECURITY] Hardcoded credentials and token exposure in GitHub workflow - High Severity

The GitHub workflow file contains multiple security vulnerabilities:

  1. Token length logging (line 207): Logs the length of sensitive tokens which can aid attackers
  2. Verbose error output (lines 215-217): Prints full API responses that may contain sensitive data
  3. Insufficient access control validation (lines 153-161): Simple string comparison for authorization without rate limiting

Security Risks:

  • Token metadata exposure aids brute force attacks
  • API response leakage may expose internal infrastructure details
  • No protection against authorization bypass attempts

Fixed Code Snippet:

# Remove token length logging
RESP="$(curl -sS -H "Authorization: token ${GHES_PAT}" \
          -H "Accept: application/vnd.github+json" \
          "${API}/pulls?state=open&head=${GHES_OWNER}:${BRANCH}" \
          -w "\n%{http_code}")"
# Don't log: echo "Token length: ${#GHES_PAT}"

HTTP_CODE="$(printf '%s\n' "$RESP" | tail -n1)"
JSON="$(printf '%s\n' "$RESP" | sed '$d')"
echo "HTTP_CODE=${HTTP_CODE}"
if [ "${HTTP_CODE}" != "200" ]; then
  echo "Non-200 response from GHES pulls query"
  # Don't echo full JSON: echo "$JSON"
  exit 78
fi

[FUNCTIONALITY] Missing error handling in vision session activation - Medium Severity

In QEfficient/generation/cloud_infer.py, the is_active flag is set but never used for validation. The code sets self.is_active = True after activation but doesn't check this flag before operations, potentially leading to operations on inactive sessions.

Problem:

  • The is_active flag is introduced but not utilized for state validation
  • No checks prevent operations on deactivated sessions
  • Could lead to runtime errors if session is used after deactivation

Fixed Code Snippet:

def __init__(self, ...):
    # ... existing code ...
    self.is_active = False
    if activate:
        self.activate()
        self.is_active = True

def run(self, inputs):
    if not self.is_active:
        raise RuntimeError("Cannot run inference on inactive session. Call activate() first.")
    # ... existing run logic ...

def deactivate(self):
    if self.is_active:
        # ... deactivation logic ...
        self.is_active = False

[PERFORMANCE] Inefficient session activation/deactivation in vision inference - Medium Severity

In QEfficient/generation/embedding_handler.py, the run_vision_inference method performs session activation/deactivation for every inference call. This creates unnecessary overhead, especially for batch processing or multiple sequential inferences.

Performance Impact:

  • Session activation/deactivation on every call adds latency
  • For batch processing, this overhead multiplies
  • Resource allocation/deallocation overhead

Recommendation:
Implement session pooling or keep sessions active for the duration of a batch operation.

Fixed Code Snippet:

def run_vision_inference(self, vision_inputs: Dict[str, np.ndarray], keep_active: bool = False) -> Dict[str, np.ndarray]:
    """Execute vision model inference with optional session persistence
    
    Args:
        vision_inputs: Preprocessed vision inputs
        keep_active: If True, keep session active after inference for subsequent calls
    """
    if not self._vision_session:
        raise ValueError("Vision session not available")

    lang_was_active = False
    try:
        if self._lang_session and self._lang_session.is_active:
            logger.debug("Deactivating language session before vision inference")
            self._lang_session.deactivate()
            lang_was_active = True

        if not self._vision_session.is_active:
            logger.debug("Activating vision session for inference")
            self._vision_session.activate()

        vision_outputs = self._vision_session.run(vision_inputs)

        if not keep_active:
            logger.debug("Deactivating vision session after inference")
            self._vision_session.deactivate()

        if lang_was_active and self._lang_session:
            logger.debug("Reactivating language session after vision inference")
            self._lang_session.activate()

        return vision_outputs
    except Exception as e:
        # ... error handling ...

[FUNCTIONALITY] Incomplete CCL initialization in decode phase - Medium Severity

In QEfficient/generation/text_generation_inference.py, the initialize_ccl method is defined but the CCL state is not properly maintained across decode iterations. The method recalculates ccl_id on every call without tracking previous state, potentially causing inconsistent context length management.

Issue:

  • CCL ID calculation starts from ccl_id_initial = 0 on every call
  • No state persistence between decode iterations
  • Could lead to incorrect context length selection during long sequences

Fixed Code Snippet:

def initialize_ccl(self, decode_inputs):
    """Initialize CCL with state tracking"""
    if not hasattr(self, '_ccl_state'):
        self._ccl_state = {
            'list_of_comp_ctx_lengths': [np.zeros(length) for length in self.comp_ctx_lengths_decode],
            'current_ccl_id': 0,
            'max_ccl_id': len(self.comp_ctx_lengths_decode) - 1
        }
    
    max_position_id = np.max(decode_inputs["position_ids"])
    
    # Update CCL ID based on current position
    for i in range(self._ccl_state['current_ccl_id'], len(self.comp_ctx_lengths_decode)):
        if max_position_id < self.comp_ctx_lengths_decode[i]:
            self._ccl_state['current_ccl_id'] = i
            break
    
    return self._ccl_state['current_ccl_id'], self._ccl_state['max_ccl_id']

** Version 1.3.6**

Help us improve!

How useful was this code feedback? Not very useful 1️⃣ 2️⃣ 3️⃣ 4️⃣ 5️⃣ Very useful

How much time did it save you (in hours)? 0 | <1 | 1-3 | >4

Let us know your detailed feedback

Source: https://github.qualcomm.com/qranium/efficient-transformers/pull/48#issuecomment-1010952

Signed-off-by: qraniumcitest <rmakar@qti.qualcomm.com>
@qraniumcitest
Copy link
Owner Author

From GHES (comment) by @qgeniecodeassistant[bot]

Code Assistant

Reviewed commit: c285190 "Update compile.py

Signed-off-by: qraniumcitest rmakar@qti.qualcomm.com"

PR Overview

This PR modifies the compilation script for QEfficient cloud compilation. The change appears to be incomplete and introduces a critical syntax error.

Files Changed Summary

File Lines Changed Issues Found Highest Severity
QEfficient/cloud/compile.py 1 1 Critical

Critical Issues

  1. Syntax Error: Line 9 contains an incomplete import statement that will cause the Python script to fail immediately upon execution. This is a blocking issue that prevents the script from running.

Impact

The current change breaks the entire compilation script and must be fixed before merging. The script will not execute in its current state.

[FUNCTIONALITY] Incomplete import statement causes syntax error - Critical

Line 9 contains an incomplete import statement with no module name specified. This is a critical syntax error that will cause Python to raise a SyntaxError immediately when attempting to run or import this module.

The original code had a blank line at line 9, which has been replaced with import without specifying what to import. This appears to be an incomplete edit or accidental change.

Impact:

  • The script will fail to execute
  • Any code that imports this module will fail
  • This is a blocking issue for the entire compilation functionality

Fixed Code Snippet:

import argparse

import QEfficient

If a specific module was intended to be imported, it should be added. Otherwise, the blank line should be restored to maintain the original functionality.


** Version 1.3.6**

Help us improve!

How useful was this code feedback? Not very useful 1️⃣ 2️⃣ 3️⃣ 4️⃣ 5️⃣ Very useful

How much time did it save you (in hours)? 0 | <1 | 1-3 | >4

Let us know your detailed feedback

Source: https://github.qualcomm.com/qranium/efficient-transformers/pull/48#issuecomment-1011030

@qraniumcitest qraniumcitest reopened this Nov 19, 2025
@qraniumcitest qraniumcitest deleted the Qgenie-Prompt-Testing branch December 3, 2025 07:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants