NCCL backend from UCC installed in NVHPCSDK

Hello,

is the nccl backend of UCC available in the hpcx-mpi installation from nvhpcsdk?

The TL is available according to ucc_info; I load the libraries with

```
module load /leonardo/prod/opt/compilers/nvhpc/25.3/binary/modulefiles/nvhpc-hpcx-cuda12/25.3
source /leonardo/prod/opt/compilers/nvhpc/25.3/binary/Linux_x86_64/25.3/comm_libs/12.8/hpcx/hpcx-2.22.1/hpcx-init.sh
hpcx_load
```
This installation uses UCC/1.4.3. I checked the availability of the TL with 

- ucc_info -s
```
Loading /leonardo/prod/opt/compilers/nvhpc/25.3/binary/modulefiles/nvhpc-hpcx-cuda12/25.3
  Loading requirement: hpcx
Default CLs scores: basic=10 hier=50
Default TLs scores: cuda=40 mlx5=1 nccl=20 self=50 sharp=30 shm=100 ucp=10
```
- ucc_info -b | grep "nccl"
```
#define UCC_CONFIGURE_FLAGS       "--with-ucx=/build-result/hpcx-v2.22.1-gcc-doca_ofed-redhat8-cuda12-x86_64/ucx --with             -sharp=/build-result/hpcx-v2.22.1-gcc-doca_ofed-redhat8-cuda12-x86_64/sharp --with-rdmacm --with-tlcp=alltoall_block --             with-cuda=/hpc/local/oss/cuda12.6.3/redhat8 --with-nccl --with-tls=cuda,nccl,self,sharp,shm,ucp,mlx5 --prefix=/build-re             sult/hpcx-v2.22.1-gcc-doca_ofed-redhat8-cuda12-x86_64/ucc"
```
At runtime I set 

```
export OMPI_MCA_coll_ucc_enable=1
export OMPI_MCA_coll_ucc_priority=100
export UCC_TL_NCCL_TUNE=allreduce:cuda:inf
```

But the TL for allreduce is not changed. From --mca coll_ucc_verbose I get always UCP as TL for cuda memory kind:

```
[1766411628.231717] [lrdn1487:319887:0] ucc_coll_score_map.c:203  UCC  INFO  Allreduce:
[1766411628.231717] [lrdn1487:319887:0] ucc_coll_score_map.c:203  UCC  INFO     Host: {0..4095}:TL_SHM:10 {4K..8K}:TL_SHM:10 {8193..inf}:TL_UCP:10
[1766411628.231717] [lrdn1487:319887:0] ucc_coll_score_map.c:203  UCC  INFO     Cuda: {0..4095}:TL_UCP:10 {4K..inf}:TL_UCP:10
[1766411628.231717] [lrdn1487:319887:0] ucc_coll_score_map.c:203  UCC  INFO     CudaManaged: {0..4095}:TL_UCP:10 {4K..inf}:TL_UCP:10
```

I can report some failures in the initialization part, related to cuda TL:

```
[1766412746.460863] [lrdn0259:811666:0]         mc_cuda.c:78   cuda mc DEBUG cuCtxGetDevice() failed: invalid device context
...
[1766412746.461583] [lrdn0259:811667:0] tl_cuda_context.c:43   TL_CUDA DEBUG cannot create CUDA TL context without active CUDA context
[1766412746.461589] [lrdn0259:811667:0]     ucc_context.c:412  UCC  DEBUG failed to create tl context for cuda

```
Could you please give me more information about the error? Should I expect this to be related to the unavailability of nccl tl?

Thank you for your time,

Laura

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NCCL backend from UCC installed in NVHPCSDK #1249

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

NCCL backend from UCC installed in NVHPCSDK #1249

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions