Skip to content

[Bug]: meta-llama/Llama-4-Scout-17B-16E-Instruct compatibility #16330

@alokkrsahu

Description

@alokkrsahu

Your current environment

Error while deploying the LLM

🐛 Describe the bug

Loading safetensors checkpoint shards: 96% Completed | 48/50 [00:03<00:00, 25.21it/s]
Loading safetensors checkpoint shards: 100% Completed | 50/50 [00:03<00:00, 13.83it/s]
(VllmWorker rank=0 pid=2770071)
(VllmWorker rank=0 pid=2770071) INFO 04-08 16:26:46 [loader.py:447] Loading weights took 177.20 seconds
(VllmWorker rank=0 pid=2770071) INFO 04-08 16:26:47 [gpu_model_runner.py:1273] Model loading took 53.1198 GiB and 178.255213 seconds
(VllmWorker rank=1 pid=2770083) INFO 04-08 16:26:49 [loader.py:447] Loading weights took 180.12 seconds
(VllmWorker rank=2 pid=2770173) INFO 04-08 16:26:49 [loader.py:447] Loading weights took 180.84 seconds
(VllmWorker rank=1 pid=2770083) INFO 04-08 16:26:49 [gpu_model_runner.py:1273] Model loading took 53.1198 GiB and 181.444976 seconds
(VllmWorker rank=2 pid=2770173) INFO 04-08 16:26:49 [gpu_model_runner.py:1273] Model loading took 53.1198 GiB and 182.203734 seconds
(VllmWorker rank=3 pid=2770198) INFO 04-08 16:26:49 [loader.py:447] Loading weights took 181.73 seconds
(VllmWorker rank=3 pid=2770198) INFO 04-08 16:26:50 [gpu_model_runner.py:1273] Model loading took

what(): CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f8f1ef6c1b6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f8f1ef15a76 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f8f1f355918 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x103ad78 (0x7f8ecd065d78 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x10433c5 (0x7f8ecd06e3c5 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: + 0x6417b2 (0x7f8f168d07b2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #6: + 0x6f30f (0x7f8f1ef4d30f in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x21b (0x7f8f1ef4633b in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f8f1ef464e9 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #9: + 0x8fefb8 (0x7f8f16b8dfb8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #10: THPVariable_subclass_dealloc(_object*) + 0x2f6 (0x7f8f16b8e306 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #11: + 0x13c035 (0x7f8e5f4a5035 in /usr/local/lib/python3.10/dist-packages/numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so)

what(): CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f8f1ef6c1b6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f8f1ef15a76 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f8f1f355918 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x103ad78 (0x7f8ecd065d78 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x10433c5 (0x7f8ecd06e3c5 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: + 0x6417b2 (0x7f8f168d07b2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #6: + 0x6f30f (0x7f8f1ef4d30f in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x21b (0x7f8f1ef4633b in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f8f1ef464e9 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #9: + 0x8fefb8 (0x7f8f16b8dfb8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #10: THPVariable_subclass_dealloc(_object*) + 0x2f6 (0x7f8f16b8e306 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #11: + 0x13c035 (0x7f8e5f4a5035 in /usr/local/lib/python3.10/dist-packages/numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so)

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions