Fix nccl regression on PyTorch 2.3 upgrade #2099

fxmarty · 2024-06-20T18:01:56Z

As per title, fixes NVIDIA/nccl#1251 in TGI's cuda image, regression introduced in #1730 & #1833

We hit this issue e.g. with llama 3 70B model with TP=4 or TP=8 on H100 & default cuda graphs, one can e.g. repro the hanging with text-generation-benchmark --tokenizer-name meta-llama/Meta-Llama-3-70B-Instruct --sequence-length 128 --decode-length 10 --warmups 2 --runs 100 -b 1, where shards hang in

Thread 1302975 (active): "MainThread"
    sched_yield (libc.so.6)
    ncclLaunchKernelBefore_NoUncapturedCuda (enqueue.cc:968)
    doLaunches (group.cc:161)
    groupLaunch (group.cc:339)
    ncclGroupEndInternal (group.cc:418)
    ncclGroupEndInternal (group.cc:368)
    ncclEnqueueCheck (enqueue.cc:1981)
    ncclAllReduce (collectives.cc:49)
    c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}>(at::Tensor&, at::Tensor&, c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::OpType, char const*, bool)::{lambda(c10::cuda::CUDAStream&, c10::intrusive_ptr<c10d::ProcessGroupNCCL::WorkNCCL, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupNCCL::WorkNCCL> >&)#1}, c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}>(at::Tensor&, at::Tensor&, c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::OpType, char const*, bool)::{lambda(c10::cuda::CUDAStream&, c10::intrusive_ptr<c10d::ProcessGroupNCCL::WorkNCCL, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupNCCL::WorkNCCL> >&)#2}> (libtorch_cuda.so)
    c10d::ProcessGroupNCCL::allreduce_impl (libtorch_cuda.so)
    c10d::ProcessGroupNCCL::allreduce (libtorch_cuda.so)
    c10d::ops::(anonymous namespace)::allreduce_CUDA (libtorch_cpu.so)

PyTorch 2.3 has a hard requirement on nccl 2.20.5 so I am not completely sure this fix is fine. We could also choose to downgrade.

interesting read as well https://pytorch.slack.com/archives/C3PDTEV8E/p1713223950622429?thread_ts=1712807088.459829&cid=C3PDTEV8E

Will wait for the build to run to check TGI's benchmark again & any potential regression.

fxmarty · 2024-06-20T18:36:12Z

Dockerfile

@@ -232,7 +234,8 @@ COPY server/Makefile server/Makefile
 RUN cd server && \
    make gen-server && \
    pip install -r requirements_cuda.txt && \
-    pip install ".[bnb, accelerate, quantize, peft, outlines]" --no-cache-dir
+    pip install ".[bnb, accelerate, quantize, peft, outlines]" --no-cache-dir && \
+    pip install nvidia-nccl-cu12==2.22.3


Would have liked to use pyproject.toml for that, but poetry disapproves of conflict handling python-poetry/poetry#697 (comment)

Narsil · 2024-06-24T17:24:17Z

Thanks a lot for the find, the fix and the details.

I'm more on the fence of waiting for torch to fix it (2.3.1 hasn't fixed it yet) since afaik this does NOT affect production.
If it did, 100% on your solution (seems better than downgrading for the time being since torch 2.3 still received some nice ugprades).

fxmarty · 2024-06-25T09:36:08Z

As you'd like. I am using this fix to benchmark.

Hugoch · 2024-07-01T11:21:48Z

Nice fix @fxmarty !
I confirm that upgrading NCCL as proposed fixes the systematic hang on 8xH100 P5 instances. TGI freezes without crashing. Pytorch 2.4 should be released this month, let's check if NCCL gets updated, otherwise it would be nice to merge that patch.

OlivierDehaene

Since this affect real deployments, let's merge this.

OlivierDehaene · 2024-07-08T08:11:59Z

Dockerfile

+    pip install ".[bnb, accelerate, quantize, peft, outlines]" --no-cache-dir && \
+    pip install nvidia-nccl-cu12==2.22.3
+
+ENV LD_PRELOAD=/opt/conda/lib/python3.10/site-packages/nvidia/nccl/lib/libnccl.so.2


Why do we need to preload?

Otherwise, the shared object is not used. The current base docker image of TGI is nvidia/cuda:12.1.0-base-ubuntu22.04, where there is no libnccl.so anywhere and it is not loaded by pytorch either, although we have /opt/conda/lib/libcudart.so.12.1.105 etc. COPY --from=pytorch-install /opt/conda /opt/conda does not seem to copy any libnccl.so. Weird.

@samsamoa

* fix nccl issue * add note in dockerfile * use v2.22.3 that also fixes @samsamoa's repro * poetry actually can't handle the conflict between torch and nccl * set LD_PRELOAD

@samsamoa

* fix nccl issue * add note in dockerfile * use v2.22.3 that also fixes @samsamoa's repro * poetry actually can't handle the conflict between torch and nccl * set LD_PRELOAD

fxmarty added 2 commits June 20, 2024 17:57

fix nccl issue

2502ce4

add note in dockerfile

a76b6f4

fxmarty requested review from OlivierDehaene and Hugoch June 20, 2024 18:05

fxmarty added 2 commits June 20, 2024 18:12

use v2.22.3 that also fixes @samsamoa's repro

27a3792

poetry actually can't handle the conflict between torch and nccl

62a1ddb

fxmarty commented Jun 20, 2024

View reviewed changes

set LD_PRELOAD

a1695ce

OlivierDehaene approved these changes Jul 8, 2024

View reviewed changes

Hugoch mentioned this pull request Jul 8, 2024

Queue size increases indefinitely #2192

Closed

4 tasks

OlivierDehaene merged commit 4c50b6d into main Jul 8, 2024
8 of 9 checks passed

OlivierDehaene deleted the fix-nccl-regression branch July 8, 2024 15:52

HoKim98 mentioned this pull request Jul 11, 2024

Tgi crash on multi GPUs #2207

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix nccl regression on PyTorch 2.3 upgrade #2099

Fix nccl regression on PyTorch 2.3 upgrade #2099

fxmarty commented Jun 20, 2024 •

edited

Loading

fxmarty Jun 20, 2024

Narsil commented Jun 24, 2024

fxmarty commented Jun 25, 2024

Hugoch commented Jul 1, 2024

OlivierDehaene left a comment

OlivierDehaene Jul 8, 2024

fxmarty Jul 8, 2024 •

edited

Loading

Fix nccl regression on PyTorch 2.3 upgrade #2099

Fix nccl regression on PyTorch 2.3 upgrade #2099

Conversation

fxmarty commented Jun 20, 2024 • edited Loading

fxmarty Jun 20, 2024

Choose a reason for hiding this comment

Narsil commented Jun 24, 2024

fxmarty commented Jun 25, 2024

Hugoch commented Jul 1, 2024

OlivierDehaene left a comment

Choose a reason for hiding this comment

OlivierDehaene Jul 8, 2024

Choose a reason for hiding this comment

fxmarty Jul 8, 2024 • edited Loading

Choose a reason for hiding this comment

fxmarty commented Jun 20, 2024 •

edited

Loading

fxmarty Jul 8, 2024 •

edited

Loading