Skip to content

[SYCL] [CUDA] Invalid memory access in multithreaded application when multiple contexts used #4171

Closed
@s-kanaev

Description

@s-kanaev

Describe the bug
Invalid memory access when using multiple contexts.

To Reproduce

Build commands:

clang++ -isystem $DPCPP_HOME/build/include/sycl -isystem $CUDADIR/include -fsycl -fsycl-targets=nvptx64-nvidia-cuda-sycldevice -fsycl-unnamed-lambda reduced_2q_repro.cpp -o reduced_2q_repro.bin -Wl,-rpath,$CUDADIR/lib64: $DPCPP_HOME/build/lib/libsycl.so $CUDADIR/lib64/libcurand.so /usr/lib/x86_64-linux-gnu/libcuda.so $CUDADIR/lib64/libcudart_static.a -lpthread -ldl /usr/lib/x86_64-linux-gnu/librt.so

clang++ -DUSE_SAME_CTX -isystem $DPCPP_HOME/build/include/sycl -isystem $CUDADIR/include -fsycl -fsycl-targets=nvptx64-nvidia-cuda-sycldevice -fsycl-unnamed-lambda reduced_2q_repro.cpp -o reduced_2q_repro_same_ctx.bin -Wl,-rpath,$CUDADIR/lib64: $DPCPP_HOME/build/lib/libsycl.so $CUDADIR/lib64/libcurand.so /usr/lib/x86_64-linux-gnu/libcuda.so $CUDADIR/lib64/libcudart_static.a -lpthread -ldl /usr/lib/x86_64-linux-gnu/librt.so

The first command builds an application which will produce invalid memory access. The second one will produce an application which won't produce invalid memory access. The only difference between these two applications is that the former will use different contexts for two queues. The latter application will use the same context for both queues.

Source code:

#include <CL/sycl.hpp>
#include <CL/sycl/backend/cuda.hpp>
#include <cuda.h>

#ifdef USE_SAME_CTX
# define CONSTRUCT_CTX(name, dev) sycl::context name{dev}
# define CONSTRUCT_Q(name, ctx, dev, ...) sycl::queue name{ctx, dev, __VA_ARGS__}
#else
# define CONSTRUCT_CTX(name, dev)
# define CONSTRUCT_Q(name, ctx, dev, ...) sycl::queue name{dev, __VA_ARGS__}
#endif

void generate(sycl::queue &Q, sycl::buffer<float> &R) {
  Q.submit([&](sycl::handler &H) {
    auto Acc = R.get_access<sycl::access::mode::read_write>(H);

    H.host_task([=](sycl::interop_handle IH) {
      auto Ptr = reinterpret_cast<float *>(IH.get_native_mem<sycl::backend::cuda>(Acc));
      (void)Ptr;
    });
  }).wait_and_throw();
}

void test(sycl::device &D) {
  std::int64_t n = 1000;
  std::vector<float> RHost(n);
  std::vector<float> R1Host(n);

  CONSTRUCT_CTX(C, D);

  try {
    CONSTRUCT_Q(Q, C, D, [](sycl::exception_list EL) {
          for (auto EP : EL)
            std::rethrow_exception(EP);
        });
    sycl::buffer<float> R{RHost.data(), RHost.size()};

    generate(Q, R);

    Q.wait_and_throw();
  } catch (const std::exception &E) {
    std::cout << "Exception: " << E.what() << std::endl;
  }

  try {
    CONSTRUCT_Q(Q, C, D, [](sycl::exception_list EL) {
          for (auto EP : EL)
            std::rethrow_exception(EP);
        });
    sycl::buffer<float> R1{R1Host.data(), R1Host.size()};

    generate(Q, R1);

    Q.wait_and_throw();
  } catch (const std::exception &E) {
    std::cout << "Exception: " << E.what() << std::endl;
  }
}

int main() {
  sycl::device D{sycl::gpu_selector{}};

  std::cout << "Running on " << D.get_info<sycl::info::device::name>() << std::endl;

  for (size_t Idx = 0; Idx < 1; ++Idx) {
    std::cout << "Iteration " << Idx << " starts" << std::endl;
    test(D);
    std::cout << "Iteration " << Idx << " ends" << std::endl;
  }
}

reduced_2q_repro.bin will not fail. Though valgrind will report invalid memory accesses.

Payload of the application (i.e. actions before, after and inside host-task) can be modified in such a way that the application will eventually hang due to the same invalid memory access.

If the for loop in main has at least two iterations, the application is killed due to SEGFAULT which is due to the same memory access violation.

Culprit of this invalid memory access is that cuCtxSetCurrent frees some memory when host-task attempts to wait for own dependencies.

Valgrind's log (with SYCL_PI_TRACE=-1) is attached for reduced_2q_repro.bin with only a single iteration of for loop in main.

Environment (please complete the following information):

  • OS: Linux
  • Target device: CUDA
  • DPC++ version:
clang version 13.0.0 (git@github.com:intel/llvm.git 63ba1ceab4945c325485cb99b205558de6025f9a)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: ......
  • CUDA version: 10.1

Additional context

reduced_2q_repro.bin.memcheck.log

Metadata

Metadata

Assignees

No one assigned

    Labels

    blockerBlocks important functionalitybugSomething isn't workingcudaCUDA back-endruntimeRuntime library related issue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions