Qgenie Prompt testing #46

qraniumcitest · 2025-11-18T09:38:20Z

No description provided.

Signed-off-by: qraniumcitest <rmakar@qti.qualcomm.com>

qraniumcitest · 2025-11-18T09:42:06Z

From GHES (comment) by @qgeniecodeassistant[bot]

Code Assistant

Reviewed Commits: cb7da87, efb34ea, 35d8fd8, 118100c, 7e8838f, 25236bb, b2dd328, be7511b, 04f1ad7, c75a637, ed965fd, c788f17, f4ff803, 44fe97b, a5056d7

cb7da87: updated notebooks (updated notebooks quic/efficient-transformers#543)

Updated the correct code with updated syntax, removed device_group
parameter in model.compile()

Signed-off-by: Sharvari Medhe smedhe@qti.qualcomm.com

efb34ea: Qwen2.5_VL Example Script Update (Qwen2.5_VL Example Script Update quic/efficient-transformers#598)

Signed-off-by: Mohit Soni mohisoni@qti.qualcomm.com

35d8fd8: Extend On-Device Sampling Support to more Causal Language Models (Extend On-Device Sampling Support to more Causal Language Models quic/efficient-transformers#553)

📢 Expanded On-Device Sampling Support in QEfficient

Excited to share that On-Device Sampling—previously available only
for LlamaForCausalLM—is now supported across a broader set of
architectures! This enhancement brings faster, more efficient inference
directly to the QAIC device.

✅ Newly Supported Architectures:

FalconForCausalLM
GemmaForCausalLM
GPT2LMHeadModel
GPTJForCausalLM
GraniteForCausalLM
GraniteMoeForCausalLM
LlamaForCausalLM (existing)
MptForCausalLM
Phi3ForCausalLM
Qwen2ForCausalLM

⚠️ Architectures Still Pending Support:

GPTBigCodeForCausalLM
InternVLChatModel
MistralForCausalLM
MixtralForCausalLM
LlamaSwiftKVForCausalLM
Grok1ModelForCausalLM

We’re actively working to extend support to these models. Contributions,
feedback, and testing from the community are always welcome to help
accelerate this effort!

Signed-off-by: quic-sanising quic_sanising@quicinc.com
Signed-off-by: sanising sanising@qti.qualcomm.com
Signed-off-by: Dhiraj Kumar Sah dhirajku@qti.qualcomm.com
Co-authored-by: sanising sanising@qti.qualcomm.com
Co-authored-by: Dhiraj Kumar Sah dhirajku@qti.qualcomm.com
Co-authored-by: Hem Agnihotri hemagnih@qti.qualcomm.com

118100c: [QEff. Finetune]: Added fix for pad_to_max_length in tokenization. ([QEff. Finetune]: Added fix for pad_to_max_length in tokenization. quic/efficient-transformers#599)

Signed-off-by: meetkuma meetkuma@qti.qualcomm.com

7e8838f: Enable CB for vlms with multiple images and multiple prompts (Enable CB for vlms with multiple images and multiple prompts quic/efficient-transformers#583)

Signed-off-by: Mamta Singh mamtsing@qti.qualcomm.com
Signed-off-by: Rishin Raj rishinr@qti.qualcomm.com
Signed-off-by: Asmita Goswami asmigosw@qti.qualcomm.com
Signed-off-by: Mohit Soni mohisoni@qti.qualcomm.com
Signed-off-by: vbaddi quic_vbaddi@quicinc.com
Co-authored-by: Mamta Singh mamtsing@qti.qualcomm.com
Co-authored-by: Asmita Goswami asmigosw@qti.qualcomm.com
Co-authored-by: Rishin Raj rishinr@qti.qualcomm.com
Co-authored-by: Mohit Soni mohisoni@qti.qualcomm.com
Co-authored-by: Vinayak Baddi vbaddi@qti.qualcomm.com

25236bb: Modeling fix (Modeling fix quic/efficient-transformers#605)

Signed-off-by: Mohit Soni mohisoni@qti.qualcom.com
Co-authored-by: Mohit Soni mohisoni@qti.qualcom.com

b2dd328: New PR for GPTOSS decode-only model (New PR for GPTOSS decode-only model quic/efficient-transformers#603)

Signed-off-by: vbaddi quic_vbaddi@quicinc.com
Signed-off-by: Onkar Chougule ochougul@qti.qualcomm.com
Signed-off-by: Mamta Singh mamtsing@qti.qualcomm.com
Signed-off-by: Mamta Singh 168400541+quic-mamta@users.noreply.github.com
Co-authored-by: Vinayak Baddi quic_vbaddi@quicinc.com
Co-authored-by: Vinayak Baddi vbaddi@qti.qualcomm.com
Co-authored-by: Mamta Singh mamtsing@qti.qualcomm.com
Co-authored-by: Mamta Singh 168400541+quic-mamta@users.noreply.github.com

be7511b: Update Qeff Documentation to indicate vLLM Support in Validated Models Page (Update Qeff Documentation to indicate vLLM Support in Validated Models Page quic/efficient-transformers#588)

Signed-off-by: Varun Gupta vargupt@qti.qualcomm.com
Co-authored-by: Abhishek Kumar Singh sabhis@qti.qualcomm.com

04f1ad7: Adding support to load checkpoints from epoch ([QEff. Finetune]: Adding support to load checkpoints from epoch quic/efficient-transformers#606)

Signed-off-by: Tanisha tchawada@qti.qualcomm.com

c75a637: "[QEff. Finetune]: Support for resuming checkpoints using Epoch" ("[QEff. Finetune]: Support for resuming checkpoints using Epoch" quic/efficient-transformers#614)

Signed-off-by: Tanisha tchawada@qti.qualcomm.com

ed965fd: [Upgradation]: onnx opset version updated from 13 to 17 ([Upgradation]: onnx opset version updated from 13 to 17 quic/efficient-transformers#587)

This pull request is created for updating the onnx opset version to 17
from 13.

Testing

Below are the models I have tested:

Causal Models

TinyLlama/TinyLlama-1.1B-Chat-v1.0
gpt2
Salesforce/codegen-350M-mono
microsoft/Phi-3-mini-4k-instruct
tiiuae/falcon-7b
Qwen/Qwen2-0.5B
Qwen/Qwen3-0.6B
bigcode/starcoder2-3b
Qwen/Qwen3-30B-A3B-Instruct-2507
Felladrin/Minueza-32M-Base
wtang06/mpt-125m-c4
hakurei/gpt-j-random-tinier
mistralai/Mixtral-8x7B-Instruct-v0.1
meta-llama/Llama-3.2-1B
unsloth/gemma-2b
unsloth/gemma-2-2b
TheBloke/TinyLlama-1.1B-Chat-v0.3-AWQ
TheBloke/Llama-2-7B-GPTQ
ibm-granite/granite-20b-code-base
neuralmagic/Llama-3.2-3B-Instruct-FP8
neuralmagic/Qwen2-0.5B-Instruct-FP8
ibm-granite/granite-3.1-2b-instruct
ibm-granite/granite-guardian-3.1-2b
hpcai-tech/grok-1
Snowflake/Llama-3.1-SwiftKV-8B-Instruct
allenai/OLMo-2-0425-1B

Embedding Models

BAAI/bge-base-en-v1.5
BAAI/bge-large-en-v1.5
BAAI/bge-small-en-v1.5
intfloat/e5-large-v2
sentence-transformers/multi-qa-mpnet-base-cos-v1
ibm-granite/granite-embedding-30m-english
ibm-granite/granite-embedding-125m-english
BAAI/bge-reranker-v2-m3
ibm-granite/granite-embedding-107m-multilingual
ibm-granite/granite-embedding-278m-multilingual

Vision Models

llava-hf/llava-1.5-7b-hf
OpenGVLab/InternVL2_5-1B
meta-llama/Llama-3.2-11B-Vision-Instruct
ibm-granite/granite-vision-3.2-2b
meta-llama/Llama-4-Scout-17B-16E-Instruct
google/gemma-3-4b-it

Audio Models

openai/whisper-tiny
openai/whisper-base
openai/whisper-small
openai/whisper-medium
openai/whisper-large
openai/whisper-large-v3-turbo

Signed-off-by: Abukhoyer Shaik abukhoye@qti.qualcomm.com

c788f17: [Docs]: Readme Fix ([Docs]: Readme Fix quic/efficient-transformers#617)

Signed-off-by: Abukhoyer Shaik abukhoye@qti.qualcomm.com

f4ff803: Adding Compute-Context-Length (CCL) (Adding Compute-Context-Length (CCL) quic/efficient-transformers#576)

Compute-Context-Length (CCL) technique optimizes the throughput of large
language models (LLMs) on Qualcomm devices when handling very large
context lengths. The current Ahead Of Time (AOT) compilation on Qualcomm
devices doesn't predict the number of tokens needed, leading to
significant throughput drops during the prefilling and the decoding
phases. This happens because the system performs attention calculations
based on large context length. To address this issue, we introduce
Compute Context Length (CCL), an additional ONNX variable that allows
for dynamic context-length specialization. By generating tokens using
smaller, more manageable context lengths (CCL), we optimize memory reads
and attention calculations, thereby improving throughput.

Signed-off-by: Vahid Janfaza vjanfaza@qti.qualcomm.com

44fe97b: Create Mirror_Fork_PRs_to_GHES.yml

Signed-off-by: qraniumcitest rmakar@qti.qualcomm.com

a5056d7: Update test_modeling_qeff.py

Signed-off-by: qraniumcitest rmakar@qti.qualcomm.com

Pull Request Overview

This PR introduces Composable Context Length (CCL) support across the QEfficient codebase, enabling dynamic context length management for LLM inference. The changes span multiple model architectures, generation infrastructure, and workflow automation.

Files Changed Summary

File	Lines Changed	Issues Found	Highest Severity
`.github/workflows/Mirror_Fork_PRs_to_GHES.yml`	+235	3	High
`QEfficient/cloud/infer.py`	+12	0	-
`QEfficient/generation/cloud_infer.py`	+4	1	Medium
`QEfficient/generation/embedding_handler.py`	+367	2	Medium
`QEfficient/generation/text_generation_inference.py`	+89	2	Medium
`QEfficient/generation/vlm_generation.py`	+800	1	Low
`QEfficient/transformers/models//modeling_.py`	~2000+	0	-
Various model files	Multiple	0	-

Key Changes

CCL Infrastructure: Added comp_ctx_lengths_prefill and comp_ctx_lengths_decode parameters throughout the generation pipeline
Vision-Language Models: New VisionHandler and VisionLanguageGeneration classes for VLM support
Model Architecture Updates: CCL support added to 20+ model architectures (Llama, Mistral, Gemma, etc.)
GitHub Workflow: New workflow for mirroring fork PRs to GHES

Critical Issues Identified

[SECURITY] Hardcoded credentials exposure in GitHub workflow (High)
[FUNCTIONALITY] Missing error handling in vision processing (Medium)
[PERFORMANCE] Inefficient session management in vision inference (Medium)
[MAINTAINABILITY] Incomplete state tracking in generation classes (Medium)

[SECURITY] Hardcoded credentials and token exposure in GitHub workflow - High Severity

The GitHub workflow file contains multiple security vulnerabilities:

Token length logging (line 207): Logs the length of sensitive tokens which can aid attackers
Verbose error output (lines 215-217): Prints full API responses that may contain sensitive data
Insufficient access control validation (lines 153-161): Simple string comparison for authorization without rate limiting

Security Risks:

Token metadata exposure aids brute force attacks
API response leakage may expose internal infrastructure details
No protection against authorization bypass attempts

Fixed Code Snippet:

# Remove token length logging
RESP="$(curl -sS -H "Authorization: token ${GHES_PAT}" \
          -H "Accept: application/vnd.github+json" \
          "${API}/pulls?state=open&head=${GHES_OWNER}:${BRANCH}" \
          -w "\n%{http_code}")"
# Don't log: echo "Token length: ${#GHES_PAT}"

HTTP_CODE="$(printf '%s\n' "$RESP" | tail -n1)"
JSON="$(printf '%s\n' "$RESP" | sed '$d')"
echo "HTTP_CODE=${HTTP_CODE}"
if [ "${HTTP_CODE}" != "200" ]; then
  echo "Non-200 response from GHES pulls query"
  # Don't echo full JSON: echo "$JSON"
  exit 78
fi

[FUNCTIONALITY] Missing error handling in vision session activation - Medium Severity

In QEfficient/generation/cloud_infer.py, the is_active flag is set but never used for validation. The code sets self.is_active = True after activation but doesn't check this flag before operations, potentially leading to operations on inactive sessions.

Problem:

The is_active flag is introduced but not utilized for state validation
No checks prevent operations on deactivated sessions
Could lead to runtime errors if session is used after deactivation

Fixed Code Snippet:

def __init__(self, ...):
    # ... existing code ...
    self.is_active = False
    if activate:
        self.activate()
        self.is_active = True

def run(self, inputs):
    if not self.is_active:
        raise RuntimeError("Cannot run inference on inactive session. Call activate() first.")
    # ... existing run logic ...

def deactivate(self):
    if self.is_active:
        # ... deactivation logic ...
        self.is_active = False

[PERFORMANCE] Inefficient session activation/deactivation in vision inference - Medium Severity

In QEfficient/generation/embedding_handler.py, the run_vision_inference method performs session activation/deactivation for every inference call. This creates unnecessary overhead, especially for batch processing or multiple sequential inferences.

Performance Impact:

Session activation/deactivation on every call adds latency
For batch processing, this overhead multiplies
Resource allocation/deallocation overhead

Recommendation:
Implement session pooling or keep sessions active for the duration of a batch operation.

Fixed Code Snippet:

def run_vision_inference(self, vision_inputs: Dict[str, np.ndarray], keep_active: bool = False) -> Dict[str, np.ndarray]:
    """Execute vision model inference with optional session persistence
    
    Args:
        vision_inputs: Preprocessed vision inputs
        keep_active: If True, keep session active after inference for subsequent calls
    """
    if not self._vision_session:
        raise ValueError("Vision session not available")

    lang_was_active = False
    try:
        if self._lang_session and self._lang_session.is_active:
            logger.debug("Deactivating language session before vision inference")
            self._lang_session.deactivate()
            lang_was_active = True

        if not self._vision_session.is_active:
            logger.debug("Activating vision session for inference")
            self._vision_session.activate()

        vision_outputs = self._vision_session.run(vision_inputs)

        if not keep_active:
            logger.debug("Deactivating vision session after inference")
            self._vision_session.deactivate()

        if lang_was_active and self._lang_session:
            logger.debug("Reactivating language session after vision inference")
            self._lang_session.activate()

        return vision_outputs
    except Exception as e:
        # ... error handling ...

[FUNCTIONALITY] Incomplete CCL initialization in decode phase - Medium Severity

In QEfficient/generation/text_generation_inference.py, the initialize_ccl method is defined but the CCL state is not properly maintained across decode iterations. The method recalculates ccl_id on every call without tracking previous state, potentially causing inconsistent context length management.

Issue:

CCL ID calculation starts from ccl_id_initial = 0 on every call
No state persistence between decode iterations
Could lead to incorrect context length selection during long sequences

Fixed Code Snippet:

def initialize_ccl(self, decode_inputs):
    """Initialize CCL with state tracking"""
    if not hasattr(self, '_ccl_state'):
        self._ccl_state = {
            'list_of_comp_ctx_lengths': [np.zeros(length) for length in self.comp_ctx_lengths_decode],
            'current_ccl_id': 0,
            'max_ccl_id': len(self.comp_ctx_lengths_decode) - 1
        }
    
    max_position_id = np.max(decode_inputs["position_ids"])
    
    # Update CCL ID based on current position
    for i in range(self._ccl_state['current_ccl_id'], len(self.comp_ctx_lengths_decode)):
        if max_position_id < self.comp_ctx_lengths_decode[i]:
            self._ccl_state['current_ccl_id'] = i
            break
    
    return self._ccl_state['current_ccl_id'], self._ccl_state['max_ccl_id']

** Version 1.3.6**

Help us improve!

How useful was this code feedback? Not very useful 1️⃣ 2️⃣ 3️⃣ 4️⃣ 5️⃣ Very useful

How much time did it save you (in hours)? 0 | <1 | 1-3 | >4

Let us know your detailed feedback

Source: https://github.qualcomm.com/qranium/efficient-transformers/pull/48#issuecomment-1010952

Signed-off-by: qraniumcitest <rmakar@qti.qualcomm.com>

qraniumcitest · 2025-11-18T10:27:19Z

From GHES (comment) by @qgeniecodeassistant[bot]

Code Assistant

Reviewed commit: c285190 "Update compile.py

Signed-off-by: qraniumcitest rmakar@qti.qualcomm.com"

PR Overview

This PR modifies the compilation script for QEfficient cloud compilation. The change appears to be incomplete and introduces a critical syntax error.

Files Changed Summary

File	Lines Changed	Issues Found	Highest Severity
QEfficient/cloud/compile.py	1	1	Critical

Critical Issues

Syntax Error: Line 9 contains an incomplete import statement that will cause the Python script to fail immediately upon execution. This is a blocking issue that prevents the script from running.

Impact

The current change breaks the entire compilation script and must be fixed before merging. The script will not execute in its current state.

[FUNCTIONALITY] Incomplete import statement causes syntax error - Critical

Line 9 contains an incomplete import statement with no module name specified. This is a critical syntax error that will cause Python to raise a SyntaxError immediately when attempting to run or import this module.

The original code had a blank line at line 9, which has been replaced with import without specifying what to import. This appears to be an incomplete edit or accidental change.

Impact:

The script will fail to execute
Any code that imports this module will fail
This is a blocking issue for the entire compilation functionality

Fixed Code Snippet:

import argparse

import QEfficient

If a specific module was intended to be imported, it should be added. Otherwise, the blank line should be restored to maintain the original functionality.

** Version 1.3.6**

Help us improve!

How useful was this code feedback? Not very useful 1️⃣ 2️⃣ 3️⃣ 4️⃣ 5️⃣ Very useful

How much time did it save you (in hours)? 0 | <1 | 1-3 | >4

Let us know your detailed feedback

Source: https://github.qualcomm.com/qranium/efficient-transformers/pull/48#issuecomment-1011030

Update test_modeling_qeff.py

a5056d7

Signed-off-by: qraniumcitest <rmakar@qti.qualcomm.com>

qraniumcitest requested review from quic-hemagnih and quic-rishinr as code owners November 18, 2025 09:38

Update compile.py

c285190

Signed-off-by: qraniumcitest <rmakar@qti.qualcomm.com>

Merge branch 'main' into Qgenie-Prompt-Testing

4940097

qraniumcitest closed this Nov 19, 2025

qraniumcitest reopened this Nov 19, 2025

qraniumcitest closed this Nov 19, 2025

qraniumcitest deleted the Qgenie-Prompt-Testing branch December 3, 2025 07:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Qgenie Prompt testing #46

Qgenie Prompt testing #46

Uh oh!

qraniumcitest commented Nov 18, 2025

Uh oh!

qraniumcitest commented Nov 18, 2025

📢 Expanded On-Device Sampling Support in QEfficient

✅ Newly Supported Architectures:

⚠️ Architectures Still Pending Support:

Testing

Causal Models

Embedding Models

Vision Models

Audio Models

Uh oh!

qraniumcitest commented Nov 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Qgenie Prompt testing #46

Qgenie Prompt testing #46

Uh oh!

Conversation

qraniumcitest commented Nov 18, 2025

Uh oh!

qraniumcitest commented Nov 18, 2025

Code Assistant

📢 Expanded On-Device Sampling Support in QEfficient

✅ Newly Supported Architectures:

⚠️ Architectures Still Pending Support:

Testing

Causal Models

Embedding Models

Vision Models

Audio Models

Pull Request Overview

Files Changed Summary

Key Changes

Critical Issues Identified

[SECURITY] Hardcoded credentials and token exposure in GitHub workflow - High Severity

[FUNCTIONALITY] Missing error handling in vision session activation - Medium Severity

[PERFORMANCE] Inefficient session activation/deactivation in vision inference - Medium Severity

[FUNCTIONALITY] Incomplete CCL initialization in decode phase - Medium Severity

Uh oh!

qraniumcitest commented Nov 18, 2025

Code Assistant

PR Overview

Files Changed Summary

Critical Issues

Impact

[FUNCTIONALITY] Incomplete import statement causes syntax error - Critical

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants