Skip to content

Conversation

@xin3he
Copy link
Contributor

@xin3he xin3he commented Jan 6, 2026

PR Type

Enhancement, Bug fix


Description

  • Added QuantizedHpuBlockSoftmaxConstMax class for handling block softmax operations

  • Updated scale calculation for FSDPA to prevent division by zero

  • Changed dynamic quant check to use op string instead of type

  • Fixed scale calculation flow for CGUID in weight scaling


Diagram Walkthrough

flowchart LR
  A["Add QuantizedHpuBlockSoftmaxConstMax"] -- "Handle block softmax" --> B["Update FSDPA scale calculation"]
  B -- "Prevent division by zero" --> C["Change dynamic quant check"]
  C -- "Use op string" --> D["Fix CGUID scale calculation"]
Loading

File Walkthrough

Relevant files
Enhancement
1 files
hpu_quantized_func_wrapper.py
Add Block Softmax and Update Scale Calculation                     
+23/-11 
Bug fix
1 files
quantize.py
Update Dynamic Quant Check and Op String Usage                     
+2/-2     
Additional files
16 files
common.py +5/-0     
external_func_impl.py +40/-0   
fp_utils.py +3/-3     
patching_common.py +7/-1     
quantized_func_wrapper.py +1/-0     
xpu_quantized_func_wrapper.py +4/-4     
scale.py +1/-1     
scale_handler.py +4/-0     
ops_quantizer.py +27/-21 
round_scales_function.py +2/-2     
scales_method.py +52/-45 
utils.py +1/-1     
vllm_functions.py +0/-32   
helper_modules.py +197/-50
quant_config.py +4/-4     
test_xpu_basic.py +79/-8   

ulivne and others added 16 commits January 5, 2026 09:54
Enabling scale calculation CGUID for static quantization, when calculating weight scale.
A flow that doesn't pass in CGUID requires to divide maxabs in fullscale and backoff factor
In PTS there's a cast to hp_dtype.
Note: In test_qdq there was a type mismatch that required explicitly casting to hp_dtype in CGUID call

Co-authored-by: linoy buchnik <lbuchnik@habana.ai>
In the input, we use zero tokens for padding. After the linear layer, we set the corresponding positions (from the padding) to -inf, so that the softmax outputs values close to epsilon.

When using FSDPA optimization, to improve performance, we avoid copying the -inf values to the softmax and instead set them directly to zero. As a result, the softmax output becomes exactly zero (as opposed to a small epsilon value without the FSDP optimization).

When computing the dynamic scale for the out_proj, this leads to a division by zero issue.

The fix we're implementing is to use max(epsilon, scale) during scale calc.

This fix aligns non-CGUID code to act the same as the CGUID flow
… also in ops_quantizer (#248)

This prevents quantizing dynamically unsupported ops implicitly
[SW-237232] add support in SGLang (#291)
[SW-237037] add support for BLOCK_SOFTMAX_CONST_MAX (#292)
Co-authored-by: Kamil Kaczor <kkaczor@habana.ai>
* [SW-239679] temporary disable deprecated import

* Update correct import path

Co-authored-by: Xin He <xin3.he@intel.com>

* DIsable also auto round tests

---------

Co-authored-by: Xin He <xin3.he@intel.com>
* [PERFC-270] add xpu qdq tests using inc

* [PERFC-270] - add xfail markers to currently unsupported tests
* update 1 element tensor as scalar

Change-Id: I0920bf38ab6de1d8940292773062be9d1de21858
Signed-off-by: Yi Liu <yiliu4@habana.ai>

* clean code

Change-Id: I744ab33f7ce4711d0968589f13f672d09f22bca6
Signed-off-by: Yi Liu <yiliu4@habana.ai>

* fix

Change-Id: Ic6ee7f38d4b911247c3727fdcc739030f65ace49
Signed-off-by: Yi Liu <yiliu4@habana.ai>

* refine

Change-Id: I0492c7d5ddb3b257bf7c550e31bc8a38c7230d08
Signed-off-by: Yi Liu <yiliu4@habana.ai>

* update doc

Change-Id: I8e8228a1d2948807f2574c418584e911eba8d949
Signed-off-by: Yi Liu <yiliu4@habana.ai>

---------

Signed-off-by: Yi Liu <yiliu4@habana.ai>
Co-authored-by: Yi Liu <yiliu4@habana.ai>
* pass dtype to scalar

---------

Signed-off-by: Yi Liu <yiliu4@habana.ai>
Co-authored-by: Yi Liu <yiliu4@habana.ai>
Change-Id: I47f5259a247bbce0c6290d1d1d1bb47071bd3256

Signed-off-by: Yi Liu <yiliu4@habana.ai>
Co-authored-by: Yi Liu <yiliu4@habana.ai>
* add VllmMixtureOfExpertsOpFP8PerChannel and refine check

---------

Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: Yi Liu <yiliu4@habana.ai>
Co-authored-by: Yi Liu <yiliu4@habana.ai>
Support deleting of MoE high precsion weights
This solves OOM issues in large model with MoE
@PRAgent4INC
Copy link
Collaborator

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Naming Consistency

The new functions calculate_scale_maxabs_with_cguid and calculate_scale_rounding_with_cguid have names that include cguid, but the old functions calculate_scale_maxabs and calculate_scale_rounding do not. Ensure that the naming is consistent or that the addition of cguid is justified and documented.

def calculate_scale_maxabs_with_cguid(x, maxMode, **kwargs):
    return torch.ops.hpu.calculate_scale_for_cast(
        x, maxMode.value, ScaleCalculationRoundingMode.NO_SCALE_ROUNDING.value, **kwargs
    )


def calculate_scale_rounding_with_cguid(x, scaleMode, **kwargs):
    return torch.ops.hpu.calculate_scale_for_cast(
        x, ScaleCalculationMaxMode.NO_MAX_CALCULATION.value, scaleMode.value, **kwargs
    )
Function Renaming Impact

The renaming of calc_maxabs_scale to calc_scale_from_maxabs could affect other parts of the codebase that rely on the original function name. Verify that all references to the old function name have been updated accordingly.

def calc_scale_from_maxabs(xmaxabs, fullscale, backoff=1):
    scale = xmaxabs / (fullscale * backoff)
    return scale

@PRAgent4INC
Copy link
Collaborator

Failed to generate code suggestions for PR

@xin3he
Copy link
Contributor Author

xin3he commented Jan 6, 2026

Hi @linoybu Feel free to drop any review comments! 😀

@xin3he xin3he requested a review from XuehaoSun January 6, 2026 02:41
@xin3he xin3he added this to the 3.7.1 milestone Jan 6, 2026
Copy link
Contributor

@yiliu30 yiliu30 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@thuang6 thuang6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this "[SW-239679] temporary fix for static quant test (#298)" commit has no file change?

@xin3he
Copy link
Contributor Author

xin3he commented Jan 8, 2026

why this "[SW-239679] temporary fix for static quant test (#298)" commit has no file change?

https://github.com/habana-internal/neural-compressor-fork/pull/298
Thanks for raising that. the cherry-pick is empty because that one fix is already in INC, and another file: test_autoround.py was moved to another place now.
The cherry-pick is not finished yet. Please expect more changes happen and hopefully we can enable the test_autoround.py back.

Signed-off-by: xinhe3 <xinhe3@habana.ai>
@xin3he xin3he force-pushed the xinhe/cherry-pick-v1.23.0 branch from d747226 to ef55a76 Compare January 8, 2026 05:39
Signed-off-by: xinhe3 <xinhe3@habana.ai>
@xin3he xin3he force-pushed the xinhe/cherry-pick-v1.23.0 branch 2 times, most recently from 553ac4d to 4d2932c Compare January 9, 2026 06:06
Signed-off-by: xinhe3 <xinhe3@habana.ai>
@xin3he xin3he force-pushed the xinhe/cherry-pick-v1.23.0 branch from 4627a5e to c0e43d7 Compare January 9, 2026 14:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.