[SYCL] [CUDA] Invalid memory access in multithreaded application when multiple contexts used

**Describe the bug**
Invalid memory access when using multiple contexts.

**To Reproduce**

Build commands:
```
clang++ -isystem $DPCPP_HOME/build/include/sycl -isystem $CUDADIR/include -fsycl -fsycl-targets=nvptx64-nvidia-cuda-sycldevice -fsycl-unnamed-lambda reduced_2q_repro.cpp -o reduced_2q_repro.bin -Wl,-rpath,$CUDADIR/lib64: $DPCPP_HOME/build/lib/libsycl.so $CUDADIR/lib64/libcurand.so /usr/lib/x86_64-linux-gnu/libcuda.so $CUDADIR/lib64/libcudart_static.a -lpthread -ldl /usr/lib/x86_64-linux-gnu/librt.so

clang++ -DUSE_SAME_CTX -isystem $DPCPP_HOME/build/include/sycl -isystem $CUDADIR/include -fsycl -fsycl-targets=nvptx64-nvidia-cuda-sycldevice -fsycl-unnamed-lambda reduced_2q_repro.cpp -o reduced_2q_repro_same_ctx.bin -Wl,-rpath,$CUDADIR/lib64: $DPCPP_HOME/build/lib/libsycl.so $CUDADIR/lib64/libcurand.so /usr/lib/x86_64-linux-gnu/libcuda.so $CUDADIR/lib64/libcudart_static.a -lpthread -ldl /usr/lib/x86_64-linux-gnu/librt.so
```

The first command builds an application which will produce invalid memory access. The second one will produce an application which won't produce invalid memory access. The only difference between these two applications is that the former will use different contexts for two queues. The latter application will use the same context for both queues.

Source code:
```c++
#include <CL/sycl.hpp>
#include <CL/sycl/backend/cuda.hpp>
#include <cuda.h>

#ifdef USE_SAME_CTX
# define CONSTRUCT_CTX(name, dev) sycl::context name{dev}
# define CONSTRUCT_Q(name, ctx, dev, ...) sycl::queue name{ctx, dev, __VA_ARGS__}
#else
# define CONSTRUCT_CTX(name, dev)
# define CONSTRUCT_Q(name, ctx, dev, ...) sycl::queue name{dev, __VA_ARGS__}
#endif

void generate(sycl::queue &Q, sycl::buffer<float> &R) {
  Q.submit([&](sycl::handler &H) {
    auto Acc = R.get_access<sycl::access::mode::read_write>(H);

    H.host_task([=](sycl::interop_handle IH) {
      auto Ptr = reinterpret_cast<float *>(IH.get_native_mem<sycl::backend::cuda>(Acc));
      (void)Ptr;
    });
  }).wait_and_throw();
}

void test(sycl::device &D) {
  std::int64_t n = 1000;
  std::vector<float> RHost(n);
  std::vector<float> R1Host(n);

  CONSTRUCT_CTX(C, D);

  try {
    CONSTRUCT_Q(Q, C, D, [](sycl::exception_list EL) {
          for (auto EP : EL)
            std::rethrow_exception(EP);
        });
    sycl::buffer<float> R{RHost.data(), RHost.size()};

    generate(Q, R);

    Q.wait_and_throw();
  } catch (const std::exception &E) {
    std::cout << "Exception: " << E.what() << std::endl;
  }

  try {
    CONSTRUCT_Q(Q, C, D, [](sycl::exception_list EL) {
          for (auto EP : EL)
            std::rethrow_exception(EP);
        });
    sycl::buffer<float> R1{R1Host.data(), R1Host.size()};

    generate(Q, R1);

    Q.wait_and_throw();
  } catch (const std::exception &E) {
    std::cout << "Exception: " << E.what() << std::endl;
  }
}

int main() {
  sycl::device D{sycl::gpu_selector{}};

  std::cout << "Running on " << D.get_info<sycl::info::device::name>() << std::endl;

  for (size_t Idx = 0; Idx < 1; ++Idx) {
    std::cout << "Iteration " << Idx << " starts" << std::endl;
    test(D);
    std::cout << "Iteration " << Idx << " ends" << std::endl;
  }
}
```

`reduced_2q_repro.bin` will not fail. Though valgrind will report invalid memory accesses.

Payload of the application (i.e. actions before, after and inside host-task) can be modified in such a way that the application will eventually hang due to the same invalid memory access.

If the `for` loop in `main` has at least two iterations, the application is killed due to SEGFAULT which is due to the same memory access violation.

Culprit of this invalid memory access is that `cuCtxSetCurrent` frees some memory when host-task attempts to wait for own dependencies.

Valgrind's log (with `SYCL_PI_TRACE=-1`) is attached for `reduced_2q_repro.bin` with only a single iteration of `for` loop in `main`.

**Environment (please complete the following information):**

- OS: Linux
- Target device: CUDA
- DPC++ version: 
```
clang version 13.0.0 (git@github.com:intel/llvm.git 63ba1ceab4945c325485cb99b205558de6025f9a)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: ......
```
- CUDA version: 10.1

**Additional context**

[reduced_2q_repro.bin.memcheck.log](https://github.com/intel/llvm/files/6870044/reduced_2q_repro.bin.memcheck.log)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SYCL] [CUDA] Invalid memory access in multithreaded application when multiple contexts used #4171

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[SYCL] [CUDA] Invalid memory access in multithreaded application when multiple contexts used #4171

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions