Skip to content

Crash when using CB mode with multi-rank #440

@a3213105

Description

@a3213105

`RUN_WORKLOAD="python /root/test.py -m /mnt/nvme1/llm_model/chatglm3-6b-32k-cpu/ -t /mnt/nvme1/llm_model/chatglm3-6b-32k -d bf16 --kv_cache_dtype int8 -c 1"

OMP_NUM_THREADS=10 LD_PRELOAD=libiomp5.so mpirun
-n 1 numactl -N 0 -p 8 ${RUN_WORKLOAD} :
-n 1 numactl -N 1 -p 9 ${RUN_WORKLOAD} :
-n 1 numactl -N 2 -p 10 ${RUN_WORKLOAD} :
-n 1 numactl -N 3 -p 11 ${RUN_WORKLOAD} :
-n 1 numactl -N 4 -p 12 ${RUN_WORKLOAD} :
-n 1 numactl -N 5 -p 13 ${RUN_WORKLOAD} :
-n 1 numactl -N 6 -p 14 ${RUN_WORKLOAD} :
-n 1 numactl -N 7 -p 15 ${RUN_WORKLOAD}`

crash results:

ENABLE_TUNED_COMM is enabled for faster reduceAdd.
ENABLE_TUNED_COMM is enabled for faster reduceAdd.
ENABLE_TUNED_COMM is enabled for faster reduceAdd.
ENABLE_TUNED_COMM is enabled for faster reduceAdd.
ENABLE_TUNED_COMM is enabled for faster reduceAdd.
ENABLE_TUNED_COMM is enabled for faster reduceAdd.
ENABLE_TUNED_COMM is enabled for faster reduceAdd.
ENABLE_TUNED_COMM is enabled for faster reduceAdd.
[hbm01:50418:0:50418] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x563774802740)
[hbm01:50419:0:50419] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x59c87fe71840)
[hbm01:50420:0:50420] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x6226e4433d80)
[hbm01:50424:0:50424] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x58213b842bc0)
[hbm01:50425:0:50425] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x6385bc3d80c0)
[hbm01:50421:0:50421] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x63b865c08d80)
[hbm01:50422:0:50422] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x603dd065ff80)
[hbm01:50423:0:50423] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x63c50dcbc900)
malloc(): corrupted top size
malloc(): corrupted top size
malloc(): corrupted top size
malloc(): corrupted top size
malloc(): corrupted top size
==== backtrace (tid: 50424) ====
0 /lib/x86_64-linux-gnu/libucs.so.0(ucs_handle_error+0x2e4) [0x72d6c3a73fc4]
1 /lib/x86_64-linux-gnu/libucs.so.0(+0x24fec) [0x72d6c3a77fec]
2 /lib/x86_64-linux-gnu/libucs.so.0(+0x251aa) [0x72d6c3a781aa]
3 /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x72d6fb842520]
4 /lib/x86_64-linux-gnu/libc.so.6(+0x1a6b55) [0x72d6fb9a6b55]
5 /root/miniforge3/envs/xft/lib/python3.9/site-packages/xfastertransformer-1.7.0-py3.9-linux-x86_64.egg/xfastertransformer/libxfastertransformer_pt.so(_ZN3xft5Model7forwardEb+0x71b) [0x72d6bb7c133d]
6 /root/miniforge3/envs/xft/lib/python3.9/site-packages/xfastertransformer-1.7.0-py3.9-linux-x86_64.egg/xfastertransformer/libxfastertransformer_pt.so(_ZN14TorchAutoModel9forwardCBEv+0x65) [0x72d6bb78142d]
7 /root/miniforge3/envs/xft/lib/python3.9/site-packages/xfastertransformer-1.7.0-py3.9-linux-x86_64.egg/xfastertransformer/libxfastertransformer_pt.so(ZSt13__invoke_implIN2at6TensorERKM14TorchAutoModelFS1_vERS2_JEET_St19__invoke_memfun_refOT0_OT1_DpOT2+0x84) [0x72d6bb7b42b7]
8 /root/miniforge3/envs/xft/lib/python3.9/site-packages/xfastertransformer-1.7.0-py3.9-linux-x86_64.egg/xfastertransformer/libxfastertransformer_pt.so(ZSt8__invokeIRKM14TorchAutoModelFN2at6TensorEvEJRS0_EENSt15__invoke_resultIT_JDpT0_EE4typeEOS9_DpOSA+0x58) [0x72d6bb7b31f4]
9 /root/miniforge3/envs/xft/lib/python3.9/site-packages/xfastertransformer-1.7.0-py3.9-linux-x86_64.egg/xfastertransformer/libxfastertransformer_pt.so(ZNKSt12_Mem_fn_baseIM14TorchAutoModelFN2at6TensorEvELb1EEclIJRS0_EEEDTcl8__invokedtdefpT6_M_pmfspcl7forwardIT_Efp_EEEDpOS8+0x49) [0x72d6bb7b2313]
10 /root/miniforge3/envs/xft/lib/python3.9/site-packages/xfastertransformer-1.7.0-py3.9-linux-x86_64.egg/xfastertransformer/libxfastertransformer_pt.so(ZN3c104guts6invokeIRM14TorchAutoModelFN2at6TensorEvEJRS2_EEENSt9enable_ifIX19is_member_pointer_vINSt5decayIT_E4typeEEENSt13invoke_resultISB_JDpT0_EE4typeEE4typeEOSB_DpOSF+0x6f) [0x72d6bb7b07b5]
11 /root/miniforge3/envs/xft/lib/python3.9/site-packages/xfastertransformer-1.7.0-py3.9-linux-x86_64.egg/xfastertransformer/libxfastertransformer_pt.so(_ZN5torch6detail10WrapMethodIM14TorchAutoModelFN2at6TensorEvEEclEN3c1013intrusive_ptrIS2_NS8_6detail34intrusive_target_default_null_typeIS2_EEEE+0x49) [0x72d6bb7ae54b]
12 /root/miniforge3/envs/xft/lib/python3.9/site-packages/xfastertransformer-1.7.0-py3.9-linux-x86_64.egg/xfastertransformer/libxfastertransformer_pt.so(_ZN5torch6detail32call_torchbind_method_from_stackINS0_10WrapMethodIM14TorchAutoModelFN2at6TensorEvEEELb0EJLm0EEEEN3c104guts23infer_function_traits_t11return_typeERT_RSt6vectorINS9_6IValueESaISG_EESt16integer_sequenceImJXspT1_EEE+0x6f) [0x72d6bb7aa9f4]
13 /root/miniforge3/envs/xft/lib/python3.9/site-packages/xfastertransformer-1.7.0-py3.9-linux-x86_64.egg/xfastertransformer/libxfastertransformer_pt.so(_ZN5torch6detail32call_torchbind_method_from_stackINS0_10WrapMethodIM14TorchAutoModelFN2at6TensorEvEEELb0EEEN3c104guts23infer_function_traits_t11return_typeERT_RSt6vectorINS9_6IValueESaISG_EE+0x46) [0x72d6bb7a3811]
14 /root/miniforge3/envs/xft/lib/python3.9/site-packages/xfastertransformer-1.7.0-py3.9-linux-x86_64.egg/xfastertransformer/libxfastertransformer_pt.so(ZN5torch6detail10BoxedProxyIN2at6TensorENS0_10WrapMethodIM14TorchAutoModelFS3_vEEEEclERSt6vectorIN3c106IValueESaISC_EERS8+0x3f) [0x72d6bb79e5bf]
15 /root/miniforge3/envs/xft/lib/python3.9/site-packages/xfastertransformer-1.7.0-py3.9-linux-x86_64.egg/xfastertransformer/libxfastertransformer_pt.so(ZZN5torch6class_I14TorchAutoModelE12defineMethodINS_6detail10WrapMethodIMS1_FN2at6TensorEvEEEEEPNS_3jit8FunctionESsT_SsSt16initializer_listINS_3argEEENUlRSt6vectorIN3c106IValueESaISK_EEE_clESN+0x3a) [0x72d6bb796686]
16 /root/miniforge3/envs/xft/lib/python3.9/site-packages/xfastertransformer-1.7.0-py3.9-linux-x86_64.egg/xfastertransformer/libxfastertransformer_pt.so(ZSt13__invoke_implIvRZN5torch6class_I14TorchAutoModelE12defineMethodINS0_6detail10WrapMethodIMS2_FN2at6TensorEvEEEEEPNS0_3jit8FunctionESsT_SsSt16initializer_listINS0_3argEEEUlRSt6vectorIN3c106IValueESaISL_EEE_JSO_EESF_St14__invoke_otherOT0_DpOT1+0x3b) [0x72d6bb7b084d]
17 /root/miniforge3/envs/xft/lib/python3.9/site-packages/xfastertransformer-1.7.0-py3.9-linux-x86_64.egg/xfastertransformer/libxfastertransformer_pt.so(ZSt10__invoke_rIvRZN5torch6class_I14TorchAutoModelE12defineMethodINS0_6detail10WrapMethodIMS2_FN2at6TensorEvEEEEEPNS0_3jit8FunctionESsT_SsSt16initializer_listINS0_3argEEEUlRSt6vectorIN3c106IValueESaISL_EEE_JSO_EENSt9enable_ifIX16is_invocable_r_vISF_T0_DpT1_EESF_E4typeEOSS_DpOST+0x3b) [0x72d6bb7ae623]
18 /root/miniforge3/envs/xft/lib/python3.9/site-packages/xfastertransformer-1.7.0-py3.9-linux-x86_64.egg/xfastertransformer/libxfastertransformer_pt.so(ZNSt17_Function_handlerIFvRSt6vectorIN3c106IValueESaIS2_EEEZN5torch6class_I14TorchAutoModelE12defineMethodINS7_6detail10WrapMethodIMS9_FN2at6TensorEvEEEEEPNS7_3jit8FunctionESsT_SsSt16initializer_listINS7_3argEEEUlS5_E_E9_M_invokeERKSt9_Any_dataS5+0x3b) [0x72d6bb7aaac1]
19 /root/miniforge3/envs/xft/lib/python3.9/site-packages/xfastertransformer-1.7.0-py3.9-linux-x86_64.egg/xfastertransformer/libxfastertransformer_pt.so(ZNKSt8functionIFvRSt6vectorIN3c106IValueESaIS2_EEEEclES5+0x4d) [0x72d6bb7884c1]
20 /root/miniforge3/envs/xft/lib/python3.9/site-packages/xfastertransformer-1.7.0-py3.9-linux-x86_64.egg/xfastertransformer/libxfastertransformer_pt.so(_ZN5torch3jit17BuiltinOpFunction3runERSt6vectorIN3c106IValueESaIS4_EE+0x2b) [0x72d6bb77d89d]
21 /root/miniforge3/envs/xft/lib/python3.9/site-packages/torch/lib/libtorch_python.so(+0xa10c6e) [0x72d6fa010c6e]
22 /root/miniforge3/envs/xft/lib/python3.9/site-packages/torch/lib/libtorch_python.so(+0xaf4581) [0x72d6fa0f4581]
23 /root/miniforge3/envs/xft/lib/python3.9/site-packages/torch/lib/libtorch_python.so(+0xab54fa) [0x72d6fa0b54fa]
24 /root/miniforge3/envs/xft/lib/python3.9/site-packages/torch/lib/libtorch_python.so(+0xab5728) [0x72d6fa0b5728]
25 /root/miniforge3/envs/xft/lib/python3.9/site-packages/torch/lib/libtorch_python.so(+0x4847bf) [0x72d6f9a847bf]
26 python(+0x15ef25) [0x58212cd0ef25]
27 python(_PyObject_MakeTpCall+0x316) [0x58212ccf5ba6]
28 python(+0x1a0791) [0x58212cd50791]
29 python(_PyObject_Call+0x10b) [0x58212cd0f32b]
30 python(+0xb9a38) [0x58212cc69a38]
31 python(_PyObject_MakeTpCall+0x316) [0x58212ccf5ba6]
32 python(_PyEval_EvalFrameDefault+0x535b) [0x58212cd93dbb]
33 python(_PyFunction_Vectorcall+0x19a) [0x58212cd4f88a]
34 python(_PyEval_EvalFrameDefault+0x609) [0x58212cd8f069]
35 python(_PyFunction_Vectorcall+0x19a) [0x58212cd4f88a]
36 python(_PyEval_EvalFrameDefault+0x3bc) [0x58212cd8ee1c]
37 python(+0x138550) [0x58212cce8550]
38 python(_PyEval_EvalCodeWithName+0x47) [0x58212cdcf047]
39 python(PyEval_EvalCodeEx+0x39) [0x58212cdcf089]
40 python(PyEval_EvalCode+0x1b) [0x58212cdcf0ab]
41 python(+0x251909) [0x58212ce01909]
42 python(+0x28c3a4) [0x58212ce3c3a4]
43 python(+0x118d33) [0x58212ccc8d33]
44 python(PyRun_SimpleFileExFlags+0x19c) [0x58212ce4683c]
45 python(Py_RunMain+0x395) [0x58212ce46f05]
46 python(Py_BytesMain+0x39) [0x58212ce47059]
47 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x72d6fb829d90]
48 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x72d6fb829e40]
49 python(+0x20bf1d) [0x58212cdbbf1d]
malloc(): corrupted top size
malloc(): corrupted top size

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 50418 RUNNING AT hbm01
= KILLED BY SIGNAL: 6 (Aborted)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 1 PID 50419 RUNNING AT hbm01
= KILLED BY SIGNAL: 6 (Aborted)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 2 PID 50420 RUNNING AT hbm01
= KILLED BY SIGNAL: 6 (Aborted)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 3 PID 50421 RUNNING AT hbm01
= KILLED BY SIGNAL: 6 (Aborted)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 4 PID 50422 RUNNING AT hbm01
= KILLED BY SIGNAL: 6 (Aborted)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 5 PID 50423 RUNNING AT hbm01
= KILLED BY SIGNAL: 6 (Aborted)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 6 PID 50424 RUNNING AT hbm01
= KILLED BY SIGNAL: 11 (Segmentation fault)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 7 PID 50425 RUNNING AT hbm01
= KILLED BY SIGNAL: 6 (Aborted)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions