Skip to content

Issues with CUDA accelerator component initialization  #11831

Closed
@devreal

Description

@devreal

We're working with the CUDA accelerator component and tried to rebase my somewhat outdated branch to current main. I believe I found an issue with the way the CUDA component is initialized: Since ae98e04 we call cuInit in accelerator_cuda_init but do not set a context. Then in every call to opal_accelerator_cuda_delayed_init henceforth (until the first call to a CUDA function by the application) we receive a NULL context from cuCtxGetCurrent and return an error (https://github.com/open-mpi/ompi/blob/main/opal/mca/accelerator/cuda/accelerator_cuda_component.c#L146). That prevents all other accelerator-related state in OMPI from properly initializing. On this particular system, at least smcuda (mca_btl_smcuda_accelerator_init) and ob1 (mca_pml_ob1_accelerator_init) do not enable accelerator support because they cannot create a stream, unless the application does call into CUDA before calling MPI_Init (because there will be a CUDA context in that case). Is this what we want?

Interestingly, before ae98e04 we would not return an error from opal_accelerator_cuda_delayed_init (because cuCtxGetCurrent returned an error code) and so the accelerator support would work properly.

I believe the same behavior exists in the 5.x release branch.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions