Reminder
System Info
我在一台离线的服务器上部署 ktransformers,使用的是docker,docker image 是 approachingai/ktransformers:v0.5.3,服务器上有两台 H200, 服务器内存为 386G,我准备尝试部署GLM 5.1,如果不够我将加内存,可是没有跑通,下面是我的运行命令
export PYTORCH_ALLOC_CONF=expandable_segments:True
export SGLANG_ENABLE_JIT_DEEPGEMM=0
python -m sglang.launch_server
--host 0.0.0.0
--port 30000
--model /code/GLM-5.1-FP8
--kt-weight-path /code/GLM-5.1-FP8
--kt-cpuinfer 96
--kt-threadpool-count 2
--kt-num-gpu-experts 72
--kt-method FP8
--kt-gpu-prefill-token-threshold 1024
--kt-enable-dynamic-expert-update
--kt-expert-placement-strategy uniform
--trust-remote-code
--mem-fraction-static 0.85
--served-model-name GLM5.1
--enable-mixed-chunk
--tensor-parallel-size 2
--enable-p2p-check
--disable-shared-experts-fusion
--chunked-prefill-size 16384
--max-running-requests 4
--max-total-tokens 128000
--attention-backend flashinfer
--kv-cache-dtype bf16
--fp8-gemm-backend cutlass
--tool-call-parser glm47
--reasoning-parser glm45
--watchdog-timeout 30000
运行完命令后会提示:
===========In NumaPool============
In Numa Worker Pool at NUMA 0, 48 threads
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
===========In NumaPool============
In Numa Worker Pool at NUMA 0, 48 threads
NUMA node 1 not found
NUMA node 1 not found
NUMA node 1 not found
NUMA node 1 not found
NUMA node 1 not found
NUMA node 1 not found
NUMA node 1 not found
NUMA node 1 not found
NUMA node 1 not found
NUMA node 1 not found
NUMA node 1 not found
NUMA node 1 not found
NUMA node 1 not found
NUMA node 1 not found
但不会报错,
接下来是报错内容:
[NativeMoEWrapper Layer 42] load_experts: 298.6ms, prepare_tensors: 0.0ms, build_ptrs: 0.4ms, create_moe: 9.6ms, cpp_load_weights: 5712.6ms, cleanup: 0.4ms, total: 6021.7ms
TP MOE layer 43, pool: 0x64386eb51e40, expert num: 256, num_experts_per_tok: 8
Created AMX_FP8_MOE_TP 0 at numa 0
alloc 1 from other numa for 7a66e7c047d0
Created AMX_FP8_MOE_TP 1 at numa 0
[rank1]:[E521 08:23:52.969958665 ProcessGroupGloo.cpp:71] Rank 1 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.
[2026-05-21 08:23:52] Rank 0 scheduler is dead. Please check if there are relevant logs.
[rank1]:[W521 08:23:52.987245866 TCPStore.cpp:125] [c10d] recvValue failed on SocketImpl(fd=57, addr=[localhost]:46838, remote=[localhost]:56731): Connection reset by peer
Exception raised from recvBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:679 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x80 (0x7c84e8570b80 in /opt/miniconda3/envs/serve/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: + 0x5ffc5b1 (0x7c852a8cd5b1 in /opt/miniconda3/envs/serve/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #2: + 0x5ffda13 (0x7c852a8cea13 in /opt/miniconda3/envs/serve/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #3: + 0x5ffe55a (0x7c852a8cf55a in /opt/miniconda3/envs/serve/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::allocator<std::__cxx11::basic_string<char, std::char_traits, std::allocator > > > const&) + 0x31e (0x7c852a8ca27e in /opt/miniconda3/envs/serve/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c8 (0x7c84e9449868 in /opt/miniconda3/envs/serve/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: + 0xdf0e6 (0x7c86c74b40e6 in /opt/miniconda3/envs/serve/bin/../lib/libstdc++.so.6)
frame #7: + 0x9caa4 (0x7c86ca077aa4 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #8: __clone + 0x44 (0x7c86ca104a64 in /usr/lib/x86_64-linux-gnu/libc.so.6)
[rank1]:[W521 08:23:52.001241019 ProcessGroupNCCL.cpp:1771] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Connection reset by peer
[2026-05-21 08:23:52 TP1] Scheduler hit an exception: Traceback (most recent call last):
File "/workspace/ktransformers/third_party/sglang/python/sglang/srt/managers/scheduler.py", line 3118, in run_scheduler_process
scheduler = Scheduler(
^^^^^^^^^^
File "/workspace/ktransformers/third_party/sglang/python/sglang/srt/managers/scheduler.py", line 363, in init
self.init_model_worker()
File "/workspace/ktransformers/third_party/sglang/python/sglang/srt/managers/scheduler.py", line 559, in init_model_worker
self.init_tp_model_worker()
File "/workspace/ktransformers/third_party/sglang/python/sglang/srt/managers/scheduler.py", line 517, in init_tp_model_worker
self.tp_worker = TpModelWorker(
^^^^^^^^^^^^^^
File "/workspace/ktransformers/third_party/sglang/python/sglang/srt/managers/tp_worker.py", line 247, in init
self._init_model_runner()
File "/workspace/ktransformers/third_party/sglang/python/sglang/srt/managers/tp_worker.py", line 330, in _init_model_runner
self._model_runner = ModelRunner(
^^^^^^^^^^^^
File "/workspace/ktransformers/third_party/sglang/python/sglang/srt/model_executor/model_runner.py", line 415, in init
self.initialize(min_per_gpu_memory)
File "/workspace/ktransformers/third_party/sglang/python/sglang/srt/model_executor/model_runner.py", line 495, in initialize
self.load_model()
File "/workspace/ktransformers/third_party/sglang/python/sglang/srt/model_executor/model_runner.py", line 1073, in load_model
raise ValueError(
ValueError: TP rank 1 could finish the model loading, but there are other ranks that didn't finish loading. It is likely due to unexpected failures (e.g., OOM) or a slow node.
[2026-05-21 08:23:52] Received sigquit from a child process. It usually means the child failed.
glm_51.sh: line 32: 4371 Killed python -m sglang.launch_server --host 0.0.0.0 --port 30000 --model /code/GLM-5.1-FP8 --kt-weight-path /code/GLM-5.1-FP8 --kt-cpuinfer 96 --kt-threadpool-count 2 --kt-num-gpu-experts 72 --kt-method FP8 --kt-gpu-prefill-token-threshold 1024 --kt-enable-dynamic-expert-update --kt-expert-placement-strategy uniform --trust-remote-code --mem-fraction-static 0.85 --served-model-name GLM5.1 --enable-mixed-chunk --tensor-parallel-size 2 --enable-p2p-check --disable-shared-experts-fusion --chunked-prefill-size 16384 --max-running-requests 4 --max-total-tokens 128000 --attention-backend flashinfer --kv-cache-dtype bf16 --fp8-gemm-backend cutlass --tool-call-parser glm47 --reasoning-parser glm45 --watchdog-timeout 30000
我不知道我该怎么办了,帮帮我吧
Reproduction
Others
No response
Reminder
System Info
我在一台离线的服务器上部署 ktransformers,使用的是docker,docker image 是 approachingai/ktransformers:v0.5.3,服务器上有两台 H200, 服务器内存为 386G,我准备尝试部署GLM 5.1,如果不够我将加内存,可是没有跑通,下面是我的运行命令
export PYTORCH_ALLOC_CONF=expandable_segments:True
export SGLANG_ENABLE_JIT_DEEPGEMM=0
python -m sglang.launch_server
--host 0.0.0.0
--port 30000
--model /code/GLM-5.1-FP8
--kt-weight-path /code/GLM-5.1-FP8
--kt-cpuinfer 96
--kt-threadpool-count 2
--kt-num-gpu-experts 72
--kt-method FP8
--kt-gpu-prefill-token-threshold 1024
--kt-enable-dynamic-expert-update
--kt-expert-placement-strategy uniform
--trust-remote-code
--mem-fraction-static 0.85
--served-model-name GLM5.1
--enable-mixed-chunk
--tensor-parallel-size 2
--enable-p2p-check
--disable-shared-experts-fusion
--chunked-prefill-size 16384
--max-running-requests 4
--max-total-tokens 128000
--attention-backend flashinfer
--kv-cache-dtype bf16
--fp8-gemm-backend cutlass
--tool-call-parser glm47
--reasoning-parser glm45
--watchdog-timeout 30000
运行完命令后会提示:
===========In NumaPool============
In Numa Worker Pool at NUMA 0, 48 threads
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
===========In NumaPool============
In Numa Worker Pool at NUMA 0, 48 threads
NUMA node 1 not found
NUMA node 1 not found
NUMA node 1 not found
NUMA node 1 not found
NUMA node 1 not found
NUMA node 1 not found
NUMA node 1 not found
NUMA node 1 not found
NUMA node 1 not found
NUMA node 1 not found
NUMA node 1 not found
NUMA node 1 not found
NUMA node 1 not found
NUMA node 1 not found
但不会报错,
接下来是报错内容:
[NativeMoEWrapper Layer 42] load_experts: 298.6ms, prepare_tensors: 0.0ms, build_ptrs: 0.4ms, create_moe: 9.6ms, cpp_load_weights: 5712.6ms, cleanup: 0.4ms, total: 6021.7ms
TP MOE layer 43, pool: 0x64386eb51e40, expert num: 256, num_experts_per_tok: 8
Created AMX_FP8_MOE_TP 0 at numa 0
alloc 1 from other numa for 7a66e7c047d0
Created AMX_FP8_MOE_TP 1 at numa 0
[rank1]:[E521 08:23:52.969958665 ProcessGroupGloo.cpp:71] Rank 1 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.
[2026-05-21 08:23:52] Rank 0 scheduler is dead. Please check if there are relevant logs.
[rank1]:[W521 08:23:52.987245866 TCPStore.cpp:125] [c10d] recvValue failed on SocketImpl(fd=57, addr=[localhost]:46838, remote=[localhost]:56731): Connection reset by peer
Exception raised from recvBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:679 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x80 (0x7c84e8570b80 in /opt/miniconda3/envs/serve/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: + 0x5ffc5b1 (0x7c852a8cd5b1 in /opt/miniconda3/envs/serve/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #2: + 0x5ffda13 (0x7c852a8cea13 in /opt/miniconda3/envs/serve/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #3: + 0x5ffe55a (0x7c852a8cf55a in /opt/miniconda3/envs/serve/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::allocator<std::__cxx11::basic_string<char, std::char_traits, std::allocator > > > const&) + 0x31e (0x7c852a8ca27e in /opt/miniconda3/envs/serve/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c8 (0x7c84e9449868 in /opt/miniconda3/envs/serve/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: + 0xdf0e6 (0x7c86c74b40e6 in /opt/miniconda3/envs/serve/bin/../lib/libstdc++.so.6)
frame #7: + 0x9caa4 (0x7c86ca077aa4 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #8: __clone + 0x44 (0x7c86ca104a64 in /usr/lib/x86_64-linux-gnu/libc.so.6)
[rank1]:[W521 08:23:52.001241019 ProcessGroupNCCL.cpp:1771] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Connection reset by peer
[2026-05-21 08:23:52 TP1] Scheduler hit an exception: Traceback (most recent call last):
File "/workspace/ktransformers/third_party/sglang/python/sglang/srt/managers/scheduler.py", line 3118, in run_scheduler_process
scheduler = Scheduler(
^^^^^^^^^^
File "/workspace/ktransformers/third_party/sglang/python/sglang/srt/managers/scheduler.py", line 363, in init
self.init_model_worker()
File "/workspace/ktransformers/third_party/sglang/python/sglang/srt/managers/scheduler.py", line 559, in init_model_worker
self.init_tp_model_worker()
File "/workspace/ktransformers/third_party/sglang/python/sglang/srt/managers/scheduler.py", line 517, in init_tp_model_worker
self.tp_worker = TpModelWorker(
^^^^^^^^^^^^^^
File "/workspace/ktransformers/third_party/sglang/python/sglang/srt/managers/tp_worker.py", line 247, in init
self._init_model_runner()
File "/workspace/ktransformers/third_party/sglang/python/sglang/srt/managers/tp_worker.py", line 330, in _init_model_runner
self._model_runner = ModelRunner(
^^^^^^^^^^^^
File "/workspace/ktransformers/third_party/sglang/python/sglang/srt/model_executor/model_runner.py", line 415, in init
self.initialize(min_per_gpu_memory)
File "/workspace/ktransformers/third_party/sglang/python/sglang/srt/model_executor/model_runner.py", line 495, in initialize
self.load_model()
File "/workspace/ktransformers/third_party/sglang/python/sglang/srt/model_executor/model_runner.py", line 1073, in load_model
raise ValueError(
ValueError: TP rank 1 could finish the model loading, but there are other ranks that didn't finish loading. It is likely due to unexpected failures (e.g., OOM) or a slow node.
[2026-05-21 08:23:52] Received sigquit from a child process. It usually means the child failed.
glm_51.sh: line 32: 4371 Killed python -m sglang.launch_server --host 0.0.0.0 --port 30000 --model /code/GLM-5.1-FP8 --kt-weight-path /code/GLM-5.1-FP8 --kt-cpuinfer 96 --kt-threadpool-count 2 --kt-num-gpu-experts 72 --kt-method FP8 --kt-gpu-prefill-token-threshold 1024 --kt-enable-dynamic-expert-update --kt-expert-placement-strategy uniform --trust-remote-code --mem-fraction-static 0.85 --served-model-name GLM5.1 --enable-mixed-chunk --tensor-parallel-size 2 --enable-p2p-check --disable-shared-experts-fusion --chunked-prefill-size 16384 --max-running-requests 4 --max-total-tokens 128000 --attention-backend flashinfer --kv-cache-dtype bf16 --fp8-gemm-backend cutlass --tool-call-parser glm47 --reasoning-parser glm45 --watchdog-timeout 30000
我不知道我该怎么办了,帮帮我吧
Reproduction
Others
No response