enhance auto device map and support XPU #961

xin3he · 2025-10-29T10:08:03Z

Related issue:

Env:

4 cards 24G Intel GPU

Llama 8B MXFP4 (w/o torch.compie)

auto-round --model meta-llama/Llama-3.1-8B-Instruct --scheme MXFP4 --device 0,1,2,3 --low_gpu_mem_usage

Auto device map result:

{'self_attn.q_proj': 'xpu:2', 'self_attn.k_proj': 'xpu:2', 'self_attn.v_proj': 'xpu:1', 'self_attn.o_proj': 'xpu:1', 'mlp.gate_proj': 'xpu:1', 'mlp.up_proj': 'xpu:2', 'mlp.down_proj': 'xpu:3'}

card_idx: max_memory_used

0: 19G, 1: 16G; 2: 13G; 3: 21G

Llama 8B W4A16 (w/o torch.compie)

auto-round --model meta-llama/Llama-3.1-8B-Instruct --scheme W4A16 --device 0,1 --low_gpu_mem_usage

Auto device map result:

{'self_attn.q_proj': 'xpu:1', 'self_attn.k_proj': 'xpu:0', 'self_attn.v_proj': 'xpu:0', 'self_attn.o_proj': 'xpu:1', 'mlp.gate_proj': 'xpu:1', 'mlp.up_proj': 'xpu:1', 'mlp.down_proj': 'xpu:1'}

card_idx: max_memory_used

0: 19G, 1: 8G

Device map logic

`estimate_tuning_block_mem` function output:

param_memory considers input activation if enable_act_quant=True
output_memory = pick_samples * seq_len * out_features * element_size
all memory consider grad by multiply 2.

layer_memory_dict = {
    'q_proj': {'param_memory': 0.018, 'output_memory': 0.024},
    'k_proj': {'param_memory': 0.018, 'output_memory': 0.024}, 
    'v_proj': {'param_memory': 0.018, 'output_memory': 0.024},
    'o_proj': {'param_memory': 0.018, 'output_memory': 0.008}
}  
input_output_memory = 4  # GB
additional_memory = 1  # GB (or 19 for XPU)

card_0_left_memory

card_0_left_memory = device_0_memory - input_output_memory - additional_memory - total_block_output_memory

`_allocate_layers_to_devices` function:

Example:
    Input:
        device_memory = {"cuda:0": 30.0, "cuda:1": 40.0, "cuda:2": 40.0}
        layer_memory_dict = {
            "q_proj": {"param_memory": 4.0}, "k_proj": {"param_memory": 1.0},
            "v_proj": {"param_memory": 1.0}, "o_proj": {"param_memory": 4.0},
            "gate_proj": {"param_memory": 11.0}, "up_proj": {"param_memory": 11.0},
            "down_proj": {"param_memory": 11.0}
        }
        mem_per_param = 2.0

    Result (allocation order by size):
        1. gate_proj (22GB) -> cuda:2 (largest, prefer last device)
        2. up_proj (22GB) -> cuda:1 (2nd largest, prefer 2nd last device)
        3. down_proj (22GB) -> cuda:0 (3rd largest, cuda:0 has 30GB available)
        4. q_proj (8GB) -> cuda:2 (neighbor of gate_proj, continuity bonus)
        5. o_proj (8GB) -> cuda:2 (neighbor of q_proj, continuity bonus)
        6. k_proj (2GB) -> cuda:1 (neighbor of q_proj via original order)
        7. v_proj (2GB) -> cuda:1 (neighbor of k_proj, continuity bonus)

Signed-off-by: He, Xin3 <xin3.he@intel.com>

xin3he · 2025-10-29T10:08:39Z

FYI. 70B is still OOM with 4 cards due to intel/torch-xpu-ops#2232

auto_round/utils/device.py

wenhuach21 · 2025-10-29T11:58:01Z

FYI. 70B is still OOM with 4 cards due to intel/torch-xpu-ops#2232

70B on cuda costs less than 50G, we may have a deep dive later, could you manually set the devcie_map yourself to move less parameters on you oom card

auto_round/utils/device.py

yiliu30 · 2025-10-29T13:06:58Z

Llama 8B W4A16 (w/o torch.compie)
auto-round --model meta-llama/Llama-3.1-8B-Instruct --scheme W4A16 --device 0,1 --low_gpu_mem_usage

Is it possible to enable torch.compile?

Signed-off-by: He, Xin3 <xin3.he@intel.com>

xin3he · 2025-10-30T02:21:57Z

Thank you, @wenhuach21 & @yiliu30
For your concerns, I updated as below:

cuda_xpu_device -> gpu_devices
v tensor is considered in param memory.
self.pick_samples = self.batch_size * self.gradient_accumulate_steps
SDPA (attention activation) memory is considers, 1GB for CUDA and 19GB for XPU
for 70B, the OOM is due to SDPA issue, 19GB is assumed from 8B and it could be larger for 70B due to big hidden dim. no quantized layer are on xpu:0
torch.compile could reduce the memory usage. I didn't attach it. It works well.

auto_round/compressors/base.py

wenhuach21 · 2025-10-30T02:46:04Z

Thank you, @wenhuach21 & @yiliu30 For your concerns, I updated as below:

cuda_xpu_device -> gpu_devices

v tensor is considered in param memory.

self.pick_samples = self.batch_size * self.gradient_accumulate_steps

SDPA (attention activation) memory is considers, 1GB for CUDA and 19GB for XPU

for 70B, the OOM is due to SDPA issue, 19GB is assumed from 8B and it could be larger for 70B due to big hidden dim. no quantized layer are on xpu:0

torch.compile could reduce the memory usage. I didn't attach it. It works well.

in tuning, typically we use micro batch size and global batch size , so pick_samples->global_batch_size?

Signed-off-by: He, Xin3 <xin3.he@intel.com>

xin3he · 2025-10-30T05:40:43Z

@wenhuach21 Agreed.

pick_samples->self.batch_size
fix device_map bug #954
revert self.pick_samples change

auto_round/compressors/base.py

auto_round/utils/device.py

Signed-off-by: He, Xin3 <xin3.he@intel.com>

xin3he · 2025-10-30T07:38:57Z

New changes:

Observe the real memory usage on CUDA, and roughly estimate additional activation memory from SiLU and Norm by doubling layer outputs memory.
pass output_device into set_auto_device_map_for_block_with_tuning
add self.low_gpu_mem_usage as constrain

xin3he · 2025-10-30T07:41:50Z

Cuda:0 memory used with no linear layers on cuda:0.

llama 8B W4A16

about 6GB on CUDA

Qwen3 8B

about 6GB on CUDA

Qwen3 32B

about 10GB on CUDA

Now the estimation is matching the observation.

xin3he · 2025-10-30T07:47:31Z

BTW, By setting output_device="cpu", which means that only quantized layers are on XPU.
8B could work with less than 10GB memory usage, but the quantization time increase from 3min to 17min. (roughly estimation).
70B could also work, no estimation time.

auto_round/compressors/base.py

Signed-off-by: He, Xin3 <xin3.he@intel.com>

wenhuach21

please have a double check.
Run the default command (auto-round --model xxx)for the 8B model to check whether the speed and accuracy are typically same with the main branch.

auto_round/utils/device.py

Signed-off-by: He, Xin3 <xin3.he@intel.com>

auto_round/utils/device.py

Signed-off-by: He, Xin3 <xin3.he@intel.com>

* Fix rtn tuning_device issue (#893) Signed-off-by: Kaihui-intel <kaihui.tang@intel.com> * fix vlm gguf ut (#895) Signed-off-by: n1ck-guo <heng.guo@intel.com> * update alg_ext.abi3.so with python compatible version (#894) * move ste from quant to round for nvfp4 (#889) Signed-off-by: He, Xin3 <xin3.he@intel.com> * Add GPT-OSS quant support (#887) * better help printing information (#883) * better help printing information Signed-off-by: n1ck-guo <heng.guo@intel.com> * speedup quant and evaluation, fix recompile issue (#897) * rewrite the implementation for ease-of-maintain Signed-off-by: He, Xin3 <xin3.he@intel.com> * fix bug Signed-off-by: He, Xin3 <xin3.he@intel.com> * fix quant performance Signed-off-by: He, Xin3 <xin3.he@intel.com> * Update auto_round/compressors/base.py --------- Signed-off-by: He, Xin3 <xin3.he@intel.com> * fix nvfp act quantization bug (#891) * fix nvfp act quantization bug Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * add cuda ut for moe nvfp quantize Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * add cpu UT, refine cuda UT Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix ut typo Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix cpu ut Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * enhance experts amax match, refine UT Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * support automatic mixed bits assignment (#851) * try to fix gguf issue (#886) * remove numba from requirments (#905) Signed-off-by: yiliu30 <yi4.liu@intel.com> * Extend mxfp loading dtypes (#907) * block dataset logger info (#908) Signed-off-by: n1ck-guo <heng.guo@intel.com> * fix torch compile issue in AutoScheme (#909) * Revert "Extend mxfp loading dtypes (#907)" (#915) This reverts commit 0c2619c. * support disable_opt_rtn in auto-scheme (#913) * fix llama 4 ut (#896) * fix ut of llama 4 Signed-off-by: n1ck-guo <heng.guo@intel.com> * add numba for cpu lib (#919) Signed-off-by: yiliu30 <yi4.liu@intel.com> * Loosen the packing restrictions for mxfp&nvfp (#911) * Loosen the packing restrictions for mxfp&nvfp, enable Qwen1.5-MoE-A2.7B quantize Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix UT Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * refine mxfp&nvfp layer checker Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * fix pylint Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Extend mxfp loading dtypes (#916) Signed-off-by: root <root@clx5673.ra.intel.com> Signed-off-by: yiliu30 <yi4.liu@intel.com> Co-authored-by: root <root@clx5673.ra.intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Fix act config exporting for mixed schemes (#903) * fp8 exporting bugfix Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * fix act related config saving Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add ut for act_config check Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * refine extra_config saving, add UTs Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * fix ut typo Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * fix ut typo Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * fixtypo Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix CI Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * fix scan issue Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * fix scan issue Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * rm global variable Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * rerun ut Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * refine ut Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * optimize rtn for int woq (#924) * fix bug of gguf and support for LiquidAI/LFM2-1.2B (#927) Signed-off-by: n1ck-guo <heng.guo@intel.com> * remove numpy<2.0 limitation (#921) * enable regex quantization config saving for mixed bits (#825) * enable dynamic quantization config saving Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixtypo Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * rebase code, refine config saving Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * refine ut Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * fix UT Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * enable hf loading for regex, add UTs Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * refine export, enhance gptq UT Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Fix Flux tuning issue (#936) Signed-off-by: Mengni Wang <mengni.wang@intel.com> * gguf support for inclusionAI/Ling-flash-2.0 (#940) * remove low_cpu_mem (#934) * Add compatibility test (#918) * Add commit hash to version (#941) Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com> * gguf weight type align with original, output.weight, token_embed (#900) * support attention mask in user's dataset (#930) * Add diffusion README (#923) * update readme (#949) * refactor utils file (#943) * refact utils Signed-off-by: n1ck-guo <heng.guo@intel.com> * update readme for sglang support (#953) * update readme for sglang support Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * refine doc Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * Update README.md --------- Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> Co-authored-by: Wenhua Cheng <wenhua.cheng@intel.com> * update gguf and support for CompressedLinear (#950) * Reduce AutoSchem VRAM usage by up to 10X (#944) * add self attribution and fix avg_bits error (#956) * add self attribution and fix avg_bits error --------- Signed-off-by: He, Xin3 <xin3.he@intel.com> Co-authored-by: Wenhua Cheng <wenhua.cheng@intel.com> * add logo (#960) * refine AutoScheme readme/code (#958) * update readme (#962) * fix critic disable_opt_rtn regression (#963) * [1/N] Initial vllm-ext evaluation support (MXFP4 MOE) (#935) Signed-off-by: yiliu30 <yi4.liu@intel.com> * fix bug of imatrix contains 0 (#955) * fix rtn bug (#966) * enhance flux doc (#967) * clean code (#968) * support for model scope (#957) * support for model scope Signed-off-by: n1ck-guo <heng.guo@intel.com> * merge main branch to alg_ext (#970) * fix cuda CI backend issue, fixtypo (#974) * disable compile packing by default (#975) Signed-off-by: yiliu30 <yi4.liu@intel.com> * enhance auto device map and support XPU (#961) * enhance auto device map and support XPU --------- Signed-off-by: He, Xin3 <xin3.he@intel.com> * refine readme (#978) * cli support for positional arguments model (#979) Signed-off-by: n1ck-guo <heng.guo@intel.com> * update bits (#986) Signed-off-by: He, Xin3 <xin3.he@intel.com> * fix guff scheme and device_map bug (#969) * add support for Magistral-Small (#980) * support model_dtype and fix bug of scheme contains quotes, mllm eval (#985) --------- Signed-off-by: Kaihui-intel <kaihui.tang@intel.com> Signed-off-by: n1ck-guo <heng.guo@intel.com> Signed-off-by: He, Xin3 <xin3.he@intel.com> Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> Signed-off-by: yiliu30 <yi4.liu@intel.com> Signed-off-by: root <root@clx5673.ra.intel.com> Signed-off-by: Mengni Wang <mengni.wang@intel.com> Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com> Co-authored-by: Tang Kaihui <kaihui.tang@intel.com> Co-authored-by: Heng Guo <heng.guo@intel.com> Co-authored-by: Xin He <xin3.he@intel.com> Co-authored-by: Yi Liu <yi4.liu@intel.com> Co-authored-by: Weiwei <weiwei1.zhang@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Wenhua Cheng <wenhua.cheng@intel.com> Co-authored-by: root <root@clx5673.ra.intel.com> Co-authored-by: Wang, Mengni <mengni.wang@intel.com> Co-authored-by: Sun, Xuehao <xuehao.sun@intel.com>

xin3he added 5 commits October 29, 2025 03:01

enhance auto device map and support XPU

f0cc131

Signed-off-by: He, Xin3 <xin3.he@intel.com>

remove mem_per_param_scale

a582d46

Signed-off-by: He, Xin3 <xin3.he@intel.com>

consider enable_act_quant and optimize device map logic

7c598ba

Signed-off-by: He, Xin3 <xin3.he@intel.com>

clear_memory for XPU

6508b74

Signed-off-by: He, Xin3 <xin3.he@intel.com>

refine device map logic

cd5f685

Signed-off-by: He, Xin3 <xin3.he@intel.com>

xin3he requested review from WeiweiZhang1, mengniwang95, n1ck-guo, wenhuach21 and yiliu30 October 29, 2025 10:18

wenhuach21 reviewed Oct 29, 2025

View reviewed changes

auto_round/utils/device.py Outdated Show resolved Hide resolved

wenhuach21 reviewed Oct 29, 2025

View reviewed changes

auto_round/utils/device.py Outdated Show resolved Hide resolved

wenhuach21 reviewed Oct 29, 2025

View reviewed changes

auto_round/utils/device.py Outdated Show resolved Hide resolved

wenhuach21 reviewed Oct 29, 2025

View reviewed changes

auto_round/utils/device.py Outdated Show resolved Hide resolved

yiliu30 reviewed Oct 29, 2025

View reviewed changes

auto_round/utils/device.py Outdated Show resolved Hide resolved

auto_round/utils/device.py Outdated Show resolved Hide resolved

update per review comments

649ffee

Signed-off-by: He, Xin3 <xin3.he@intel.com>

xin3he requested review from wenhuach21 and yiliu30 October 30, 2025 02:22

wenhuach21 reviewed Oct 30, 2025

View reviewed changes

auto_round/compressors/base.py Outdated Show resolved Hide resolved

wenhuach21 reviewed Oct 30, 2025

View reviewed changes

auto_round/compressors/base.py Outdated Show resolved Hide resolved

wenhuach21 reviewed Oct 30, 2025

View reviewed changes

auto_round/compressors/base.py Outdated Show resolved Hide resolved

wenhuach21 reviewed Oct 30, 2025

View reviewed changes

auto_round/compressors/base.py Outdated Show resolved Hide resolved

clear_memory only xpu and remove block.to(device)

40d634c

Signed-off-by: He, Xin3 <xin3.he@intel.com>

wenhuach21 reviewed Oct 30, 2025

View reviewed changes

auto_round/compressors/base.py Outdated Show resolved Hide resolved

wenhuach21 reviewed Oct 30, 2025

View reviewed changes

auto_round/utils/device.py Outdated Show resolved Hide resolved

xin3he added 2 commits October 30, 2025 02:50

fix bug and refine additional memory calcu logic

135339d

Signed-off-by: He, Xin3 <xin3.he@intel.com>

consider output_device when setting device_map

488eec1

Signed-off-by: He, Xin3 <xin3.he@intel.com>

xin3he requested a review from wenhuach21 October 30, 2025 07:48

wenhuach21 reviewed Oct 30, 2025

View reviewed changes

auto_round/compressors/base.py Show resolved Hide resolved

xin3he requested a review from wenhuach21 October 30, 2025 08:46

fix bug

42b1453

Signed-off-by: He, Xin3 <xin3.he@intel.com>

wenhuach21 approved these changes Oct 30, 2025

View reviewed changes

yiliu30 reviewed Oct 31, 2025

View reviewed changes

auto_round/utils/device.py Show resolved Hide resolved

clear_memory_if_reached_threshold

194ac27

Signed-off-by: He, Xin3 <xin3.he@intel.com>

xin3he requested review from wenhuach21 and yiliu30 October 31, 2025 08:40

add warning for XPU additional memory

ddb20ed

Signed-off-by: He, Xin3 <xin3.he@intel.com>

wenhuach21 reviewed Oct 31, 2025

View reviewed changes

auto_round/utils/device.py Outdated Show resolved Hide resolved

xin3he added 2 commits October 31, 2025 05:44

unify code

015331b

Signed-off-by: He, Xin3 <xin3.he@intel.com>

Merge remote-tracking branch 'origin/main' into xinhe/fix

43c642c

Signed-off-by: He, Xin3 <xin3.he@intel.com>

xin3he requested a review from wenhuach21 October 31, 2025 10:38

yiliu30 approved these changes Oct 31, 2025

View reviewed changes

xin3he merged commit 06dd686 into main Oct 31, 2025
23 checks passed

xin3he deleted the xinhe/fix branch October 31, 2025 12:46

enhance auto device map and support XPU #961

enhance auto device map and support XPU #961

Conversation

xin3he commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related issue:

Env:

Llama 8B MXFP4 (w/o torch.compie)

Auto device map result:

card_idx: max_memory_used

Llama 8B W4A16 (w/o torch.compie)

Auto device map result:

card_idx: max_memory_used

Device map logic

estimate_tuning_block_mem function output:

card_0_left_memory

_allocate_layers_to_devices function:

Uh oh!

xin3he commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wenhuach21 commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yiliu30 commented Oct 29, 2025

Uh oh!

xin3he commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wenhuach21 commented Oct 30, 2025

Uh oh!

xin3he commented Oct 30, 2025

Uh oh!

Uh oh!

Uh oh!

xin3he commented Oct 30, 2025

Uh oh!

xin3he commented Oct 30, 2025

Cuda:0 memory used with no linear layers on cuda:0.

Uh oh!

xin3he commented Oct 30, 2025

Uh oh!

Uh oh!

wenhuach21 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xin3he commented Oct 29, 2025 •

edited

Loading

`estimate_tuning_block_mem` function output:

`_allocate_layers_to_devices` function:

xin3he commented Oct 29, 2025 •

edited

Loading

wenhuach21 commented Oct 29, 2025 •

edited

Loading

xin3he commented Oct 30, 2025 •

edited

Loading