Skip to content

H200 部署 GLM 5.1 #2019

@OldSixOne

Description

@OldSixOne

Reminder

  • I have read the above rules and searched the existing issues.

System Info

我在一台离线的服务器上部署 ktransformers,使用的是docker,docker image 是 approachingai/ktransformers:v0.5.3,服务器上有两台 H200, 服务器内存为 386G,我准备尝试部署GLM 5.1,如果不够我将加内存,可是没有跑通,下面是我的运行命令
export PYTORCH_ALLOC_CONF=expandable_segments:True
export SGLANG_ENABLE_JIT_DEEPGEMM=0

python -m sglang.launch_server
--host 0.0.0.0
--port 30000
--model /code/GLM-5.1-FP8
--kt-weight-path /code/GLM-5.1-FP8
--kt-cpuinfer 96
--kt-threadpool-count 2
--kt-num-gpu-experts 72
--kt-method FP8
--kt-gpu-prefill-token-threshold 1024
--kt-enable-dynamic-expert-update
--kt-expert-placement-strategy uniform
--trust-remote-code
--mem-fraction-static 0.85
--served-model-name GLM5.1
--enable-mixed-chunk
--tensor-parallel-size 2
--enable-p2p-check
--disable-shared-experts-fusion
--chunked-prefill-size 16384
--max-running-requests 4
--max-total-tokens 128000
--attention-backend flashinfer
--kv-cache-dtype bf16
--fp8-gemm-backend cutlass
--tool-call-parser glm47
--reasoning-parser glm45
--watchdog-timeout 30000
运行完命令后会提示:
===========In NumaPool============
In Numa Worker Pool at NUMA 0, 48 threads
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
===========In NumaPool============
In Numa Worker Pool at NUMA 0, 48 threads
NUMA node 1 not found
NUMA node 1 not found
NUMA node 1 not found
NUMA node 1 not found
NUMA node 1 not found
NUMA node 1 not found
NUMA node 1 not found
NUMA node 1 not found
NUMA node 1 not found
NUMA node 1 not found
NUMA node 1 not found
NUMA node 1 not found
NUMA node 1 not found
NUMA node 1 not found
但不会报错,
接下来是报错内容:
[NativeMoEWrapper Layer 42] load_experts: 298.6ms, prepare_tensors: 0.0ms, build_ptrs: 0.4ms, create_moe: 9.6ms, cpp_load_weights: 5712.6ms, cleanup: 0.4ms, total: 6021.7ms
TP MOE layer 43, pool: 0x64386eb51e40, expert num: 256, num_experts_per_tok: 8
Created AMX_FP8_MOE_TP 0 at numa 0
alloc 1 from other numa for 7a66e7c047d0
Created AMX_FP8_MOE_TP 1 at numa 0
[rank1]:[E521 08:23:52.969958665 ProcessGroupGloo.cpp:71] Rank 1 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.
[2026-05-21 08:23:52] Rank 0 scheduler is dead. Please check if there are relevant logs.
[rank1]:[W521 08:23:52.987245866 TCPStore.cpp:125] [c10d] recvValue failed on SocketImpl(fd=57, addr=[localhost]:46838, remote=[localhost]:56731): Connection reset by peer
Exception raised from recvBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:679 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x80 (0x7c84e8570b80 in /opt/miniconda3/envs/serve/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: + 0x5ffc5b1 (0x7c852a8cd5b1 in /opt/miniconda3/envs/serve/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #2: + 0x5ffda13 (0x7c852a8cea13 in /opt/miniconda3/envs/serve/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #3: + 0x5ffe55a (0x7c852a8cf55a in /opt/miniconda3/envs/serve/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::allocator<std::__cxx11::basic_string<char, std::char_traits, std::allocator > > > const&) + 0x31e (0x7c852a8ca27e in /opt/miniconda3/envs/serve/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c8 (0x7c84e9449868 in /opt/miniconda3/envs/serve/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: + 0xdf0e6 (0x7c86c74b40e6 in /opt/miniconda3/envs/serve/bin/../lib/libstdc++.so.6)
frame #7: + 0x9caa4 (0x7c86ca077aa4 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #8: __clone + 0x44 (0x7c86ca104a64 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[W521 08:23:52.001241019 ProcessGroupNCCL.cpp:1771] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Connection reset by peer
[2026-05-21 08:23:52 TP1] Scheduler hit an exception: Traceback (most recent call last):
File "/workspace/ktransformers/third_party/sglang/python/sglang/srt/managers/scheduler.py", line 3118, in run_scheduler_process
scheduler = Scheduler(
^^^^^^^^^^
File "/workspace/ktransformers/third_party/sglang/python/sglang/srt/managers/scheduler.py", line 363, in init
self.init_model_worker()
File "/workspace/ktransformers/third_party/sglang/python/sglang/srt/managers/scheduler.py", line 559, in init_model_worker
self.init_tp_model_worker()
File "/workspace/ktransformers/third_party/sglang/python/sglang/srt/managers/scheduler.py", line 517, in init_tp_model_worker
self.tp_worker = TpModelWorker(
^^^^^^^^^^^^^^
File "/workspace/ktransformers/third_party/sglang/python/sglang/srt/managers/tp_worker.py", line 247, in init
self._init_model_runner()
File "/workspace/ktransformers/third_party/sglang/python/sglang/srt/managers/tp_worker.py", line 330, in _init_model_runner
self._model_runner = ModelRunner(
^^^^^^^^^^^^
File "/workspace/ktransformers/third_party/sglang/python/sglang/srt/model_executor/model_runner.py", line 415, in init
self.initialize(min_per_gpu_memory)
File "/workspace/ktransformers/third_party/sglang/python/sglang/srt/model_executor/model_runner.py", line 495, in initialize
self.load_model()
File "/workspace/ktransformers/third_party/sglang/python/sglang/srt/model_executor/model_runner.py", line 1073, in load_model
raise ValueError(
ValueError: TP rank 1 could finish the model loading, but there are other ranks that didn't finish loading. It is likely due to unexpected failures (e.g., OOM) or a slow node.

[2026-05-21 08:23:52] Received sigquit from a child process. It usually means the child failed.
glm_51.sh: line 32: 4371 Killed python -m sglang.launch_server --host 0.0.0.0 --port 30000 --model /code/GLM-5.1-FP8 --kt-weight-path /code/GLM-5.1-FP8 --kt-cpuinfer 96 --kt-threadpool-count 2 --kt-num-gpu-experts 72 --kt-method FP8 --kt-gpu-prefill-token-threshold 1024 --kt-enable-dynamic-expert-update --kt-expert-placement-strategy uniform --trust-remote-code --mem-fraction-static 0.85 --served-model-name GLM5.1 --enable-mixed-chunk --tensor-parallel-size 2 --enable-p2p-check --disable-shared-experts-fusion --chunked-prefill-size 16384 --max-running-requests 4 --max-total-tokens 128000 --attention-backend flashinfer --kv-cache-dtype bf16 --fp8-gemm-backend cutlass --tool-call-parser glm47 --reasoning-parser glm45 --watchdog-timeout 30000
我不知道我该怎么办了,帮帮我吧

Reproduction

Put your message here.

Others

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions