Skip to content

Conversation

@xin3he
Copy link
Contributor

@xin3he xin3he commented Oct 29, 2025

Related issue:

Env:

4 cards 24G Intel GPU

Llama 8B MXFP4 (w/o torch.compie)

auto-round --model meta-llama/Llama-3.1-8B-Instruct --scheme MXFP4 --device 0,1,2,3 --low_gpu_mem_usage

Auto device map result:

{'self_attn.q_proj': 'xpu:2', 'self_attn.k_proj': 'xpu:2', 'self_attn.v_proj': 'xpu:1', 'self_attn.o_proj': 'xpu:1', 'mlp.gate_proj': 'xpu:1', 'mlp.up_proj': 'xpu:2', 'mlp.down_proj': 'xpu:3'}

card_idx: max_memory_used

0: 19G, 1: 16G; 2: 13G; 3: 21G

Llama 8B W4A16 (w/o torch.compie)

auto-round --model meta-llama/Llama-3.1-8B-Instruct --scheme W4A16 --device 0,1 --low_gpu_mem_usage

Auto device map result:

{'self_attn.q_proj': 'xpu:1', 'self_attn.k_proj': 'xpu:0', 'self_attn.v_proj': 'xpu:0', 'self_attn.o_proj': 'xpu:1', 'mlp.gate_proj': 'xpu:1', 'mlp.up_proj': 'xpu:1', 'mlp.down_proj': 'xpu:1'}

card_idx: max_memory_used

0: 19G, 1: 8G

Device map logic

estimate_tuning_block_mem function output:

  • param_memory considers input activation if enable_act_quant=True
  • output_memory = pick_samples * seq_len * out_features * element_size
  • all memory consider grad by multiply 2.
layer_memory_dict = {
    'q_proj': {'param_memory': 0.018, 'output_memory': 0.024},
    'k_proj': {'param_memory': 0.018, 'output_memory': 0.024}, 
    'v_proj': {'param_memory': 0.018, 'output_memory': 0.024},
    'o_proj': {'param_memory': 0.018, 'output_memory': 0.008}
}  
input_output_memory = 4  # GB
additional_memory = 1  # GB (or 19 for XPU)

card_0_left_memory

card_0_left_memory = device_0_memory - input_output_memory - additional_memory - total_block_output_memory

_allocate_layers_to_devices function:

Example:
    Input:
        device_memory = {"cuda:0": 30.0, "cuda:1": 40.0, "cuda:2": 40.0}
        layer_memory_dict = {
            "q_proj": {"param_memory": 4.0}, "k_proj": {"param_memory": 1.0},
            "v_proj": {"param_memory": 1.0}, "o_proj": {"param_memory": 4.0},
            "gate_proj": {"param_memory": 11.0}, "up_proj": {"param_memory": 11.0},
            "down_proj": {"param_memory": 11.0}
        }
        mem_per_param = 2.0

    Result (allocation order by size):
        1. gate_proj (22GB) -> cuda:2 (largest, prefer last device)
        2. up_proj (22GB) -> cuda:1 (2nd largest, prefer 2nd last device)
        3. down_proj (22GB) -> cuda:0 (3rd largest, cuda:0 has 30GB available)
        4. q_proj (8GB) -> cuda:2 (neighbor of gate_proj, continuity bonus)
        5. o_proj (8GB) -> cuda:2 (neighbor of q_proj, continuity bonus)
        6. k_proj (2GB) -> cuda:1 (neighbor of q_proj via original order)
        7. v_proj (2GB) -> cuda:1 (neighbor of k_proj, continuity bonus)

Signed-off-by: He, Xin3 <xin3.he@intel.com>
Signed-off-by: He, Xin3 <xin3.he@intel.com>
Signed-off-by: He, Xin3 <xin3.he@intel.com>
Signed-off-by: He, Xin3 <xin3.he@intel.com>
Signed-off-by: He, Xin3 <xin3.he@intel.com>
@xin3he
Copy link
Contributor Author

xin3he commented Oct 29, 2025

FYI. 70B is still OOM with 4 cards due to intel/torch-xpu-ops#2232

@wenhuach21
Copy link
Contributor

wenhuach21 commented Oct 29, 2025

FYI. 70B is still OOM with 4 cards due to intel/torch-xpu-ops#2232

70B on cuda costs less than 50G, we may have a deep dive later, could you manually set the devcie_map yourself to move less parameters on you oom card

@yiliu30
Copy link
Contributor

yiliu30 commented Oct 29, 2025

Llama 8B W4A16 (w/o torch.compie)
auto-round --model meta-llama/Llama-3.1-8B-Instruct --scheme W4A16 --device 0,1 --low_gpu_mem_usage

Is it possible to enable torch.compile?

Signed-off-by: He, Xin3 <xin3.he@intel.com>
@xin3he
Copy link
Contributor Author

xin3he commented Oct 30, 2025

Thank you, @wenhuach21 & @yiliu30
For your concerns, I updated as below:

  • cuda_xpu_device -> gpu_devices
  • v tensor is considered in param memory.
  • self.pick_samples = self.batch_size * self.gradient_accumulate_steps
  • SDPA (attention activation) memory is considers, 1GB for CUDA and 19GB for XPU
  • for 70B, the OOM is due to SDPA issue, 19GB is assumed from 8B and it could be larger for 70B due to big hidden dim. no quantized layer are on xpu:0
  • torch.compile could reduce the memory usage. I didn't attach it. It works well.

@xin3he xin3he requested review from wenhuach21 and yiliu30 October 30, 2025 02:22
@wenhuach21
Copy link
Contributor

Thank you, @wenhuach21 & @yiliu30 For your concerns, I updated as below:

  • cuda_xpu_device -> gpu_devices
  • v tensor is considered in param memory.
  • self.pick_samples = self.batch_size * self.gradient_accumulate_steps
  • SDPA (attention activation) memory is considers, 1GB for CUDA and 19GB for XPU
  • for 70B, the OOM is due to SDPA issue, 19GB is assumed from 8B and it could be larger for 70B due to big hidden dim. no quantized layer are on xpu:0
  • torch.compile could reduce the memory usage. I didn't attach it. It works well.

in tuning, typically we use micro batch size and global batch size , so pick_samples->global_batch_size?

Signed-off-by: He, Xin3 <xin3.he@intel.com>
@xin3he
Copy link
Contributor Author

xin3he commented Oct 30, 2025

@wenhuach21 Agreed.

Signed-off-by: He, Xin3 <xin3.he@intel.com>
Signed-off-by: He, Xin3 <xin3.he@intel.com>
@xin3he
Copy link
Contributor Author

xin3he commented Oct 30, 2025

New changes:

  • Observe the real memory usage on CUDA, and roughly estimate additional activation memory from SiLU and Norm by doubling layer outputs memory.
  • pass output_device into set_auto_device_map_for_block_with_tuning
  • add self.low_gpu_mem_usage as constrain

@xin3he
Copy link
Contributor Author

xin3he commented Oct 30, 2025

Cuda:0 memory used with no linear layers on cuda:0.

llama 8B W4A16

  • about 6GB on CUDA

Qwen3 8B

  • about 6GB on CUDA

Qwen3 32B

  • about 10GB on CUDA

Now the estimation is matching the observation.

@xin3he
Copy link
Contributor Author

xin3he commented Oct 30, 2025

BTW, By setting output_device="cpu", which means that only quantized layers are on XPU.
8B could work with less than 10GB memory usage, but the quantization time increase from 3min to 17min. (roughly estimation).
70B could also work, no estimation time.

@xin3he xin3he requested a review from wenhuach21 October 30, 2025 07:48
@xin3he xin3he requested a review from wenhuach21 October 30, 2025 08:46
Signed-off-by: He, Xin3 <xin3.he@intel.com>
Copy link
Contributor

@wenhuach21 wenhuach21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please have a double check.
Run the default command (auto-round --model xxx)for the 8B model to check whether the speed and accuracy are typically same with the main branch.

Signed-off-by: He, Xin3 <xin3.he@intel.com>
@xin3he xin3he requested review from wenhuach21 and yiliu30 October 31, 2025 08:40
Signed-off-by: He, Xin3 <xin3.he@intel.com>
Signed-off-by: He, Xin3 <xin3.he@intel.com>
Signed-off-by: He, Xin3 <xin3.he@intel.com>
@xin3he xin3he requested a review from wenhuach21 October 31, 2025 10:38
@xin3he xin3he merged commit 06dd686 into main Oct 31, 2025
23 checks passed
@xin3he xin3he deleted the xinhe/fix branch October 31, 2025 12:46
chensuyue added a commit that referenced this pull request Nov 11, 2025
* Fix rtn tuning_device issue (#893)

Signed-off-by: Kaihui-intel <kaihui.tang@intel.com>

* fix vlm gguf ut (#895)

Signed-off-by: n1ck-guo <heng.guo@intel.com>

* update alg_ext.abi3.so with python compatible version (#894)

* move ste from quant to round for nvfp4 (#889)

Signed-off-by: He, Xin3 <xin3.he@intel.com>

* Add GPT-OSS quant support (#887)

* better help printing information (#883)

* better help printing information

Signed-off-by: n1ck-guo <heng.guo@intel.com>

* speedup quant and evaluation, fix recompile issue (#897)

* rewrite the implementation for ease-of-maintain

Signed-off-by: He, Xin3 <xin3.he@intel.com>

* fix bug

Signed-off-by: He, Xin3 <xin3.he@intel.com>

* fix quant performance

Signed-off-by: He, Xin3 <xin3.he@intel.com>

* Update auto_round/compressors/base.py

---------

Signed-off-by: He, Xin3 <xin3.he@intel.com>

* fix nvfp act quantization bug (#891)

* fix nvfp act quantization bug

Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>

* add cuda ut for moe nvfp quantize

Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>

* add cpu UT, refine cuda UT

Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix ut typo

Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix cpu ut

Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>

* enhance experts amax match, refine UT

Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* support automatic mixed bits assignment (#851)

* try to fix gguf issue (#886)

* remove numba from requirments (#905)

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* Extend mxfp loading dtypes (#907)

* block dataset logger info (#908)

Signed-off-by: n1ck-guo <heng.guo@intel.com>

* fix torch compile issue in AutoScheme (#909)

* Revert "Extend mxfp loading dtypes (#907)" (#915)

This reverts commit 0c2619c.

* support disable_opt_rtn in auto-scheme (#913)

* fix llama 4 ut (#896)

* fix ut of llama 4

Signed-off-by: n1ck-guo <heng.guo@intel.com>

* add numba for cpu lib (#919)

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* Loosen the packing restrictions for mxfp&nvfp (#911)

* Loosen the packing restrictions for mxfp&nvfp, enable Qwen1.5-MoE-A2.7B quantize

Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix UT

Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* refine mxfp&nvfp layer checker

Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>

* fix pylint

Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Extend mxfp loading dtypes (#916)

Signed-off-by: root <root@clx5673.ra.intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Co-authored-by: root <root@clx5673.ra.intel.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Fix act config exporting for mixed schemes (#903)

* fp8 exporting bugfix

Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>

* fix act related config saving

Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add ut for act_config check

Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* refine extra_config saving, add UTs

Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>

* fix ut typo

Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>

* fix ut typo

Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>

* fixtypo

Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix CI

Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>

* fix scan issue

Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>

* fix scan issue

Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>

* rm global variable

Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* rerun ut

Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* refine ut

Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* optimize rtn for int woq (#924)

* fix bug of gguf and support for LiquidAI/LFM2-1.2B (#927)

Signed-off-by: n1ck-guo <heng.guo@intel.com>

* remove numpy<2.0 limitation (#921)

* enable regex quantization config saving for mixed bits (#825)

* enable dynamic quantization config saving

Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fixtypo

Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* rebase code, refine config saving

Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* refine ut

Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>

* fix UT

Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* enable hf loading for regex, add UTs

Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* refine export, enhance gptq UT

Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Fix Flux tuning issue (#936)

Signed-off-by: Mengni Wang <mengni.wang@intel.com>

* gguf support for inclusionAI/Ling-flash-2.0 (#940)

* remove low_cpu_mem (#934)

* Add compatibility test (#918)

* Add commit hash to version (#941)

Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com>

* gguf weight type align with original, output.weight, token_embed (#900)

* support attention mask in user's dataset (#930)

* Add diffusion README (#923)

* update readme (#949)

* refactor utils file (#943)

* refact utils

Signed-off-by: n1ck-guo <heng.guo@intel.com>

* update readme for sglang support (#953)

* update readme for sglang support

Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>

* refine doc

Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>

* Update README.md

---------

Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>
Co-authored-by: Wenhua Cheng <wenhua.cheng@intel.com>

* update gguf and support for CompressedLinear (#950)

* Reduce AutoSchem VRAM usage by up to 10X (#944)

* add self attribution and fix avg_bits error (#956)

* add self attribution and fix avg_bits error
---------

Signed-off-by: He, Xin3 <xin3.he@intel.com>
Co-authored-by: Wenhua Cheng <wenhua.cheng@intel.com>

* add logo (#960)

* refine AutoScheme readme/code (#958)

* update readme (#962)

* fix critic disable_opt_rtn regression (#963)

* [1/N] Initial vllm-ext evaluation support (MXFP4 MOE) (#935)

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* fix bug of imatrix contains 0 (#955)

* fix rtn bug (#966)

* enhance flux doc (#967)

* clean code (#968)

* support for model scope  (#957)

* support for model scope

Signed-off-by: n1ck-guo <heng.guo@intel.com>

* merge main branch to alg_ext (#970)

* fix cuda CI backend issue, fixtypo (#974)

* disable compile packing by default (#975)

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* enhance auto device map and support XPU  (#961)

* enhance auto device map and support XPU
---------

Signed-off-by: He, Xin3 <xin3.he@intel.com>

* refine readme (#978)

* cli support for positional arguments model (#979)

Signed-off-by: n1ck-guo <heng.guo@intel.com>

* update bits (#986)

Signed-off-by: He, Xin3 <xin3.he@intel.com>

* fix guff scheme and device_map bug (#969)

* add support for Magistral-Small (#980)

* support model_dtype and fix bug of scheme contains quotes, mllm eval (#985)

---------

Signed-off-by: Kaihui-intel <kaihui.tang@intel.com>
Signed-off-by: n1ck-guo <heng.guo@intel.com>
Signed-off-by: He, Xin3 <xin3.he@intel.com>
Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: root <root@clx5673.ra.intel.com>
Signed-off-by: Mengni Wang <mengni.wang@intel.com>
Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com>
Co-authored-by: Tang Kaihui <kaihui.tang@intel.com>
Co-authored-by: Heng Guo <heng.guo@intel.com>
Co-authored-by: Xin He <xin3.he@intel.com>
Co-authored-by: Yi Liu <yi4.liu@intel.com>
Co-authored-by: Weiwei <weiwei1.zhang@intel.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Wenhua Cheng <wenhua.cheng@intel.com>
Co-authored-by: root <root@clx5673.ra.intel.com>
Co-authored-by: Wang, Mengni <mengni.wang@intel.com>
Co-authored-by: Sun, Xuehao <xuehao.sun@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants