Description
Describe the bug
Invalid memory access when using multiple contexts.
To Reproduce
Build commands:
clang++ -isystem $DPCPP_HOME/build/include/sycl -isystem $CUDADIR/include -fsycl -fsycl-targets=nvptx64-nvidia-cuda-sycldevice -fsycl-unnamed-lambda reduced_2q_repro.cpp -o reduced_2q_repro.bin -Wl,-rpath,$CUDADIR/lib64: $DPCPP_HOME/build/lib/libsycl.so $CUDADIR/lib64/libcurand.so /usr/lib/x86_64-linux-gnu/libcuda.so $CUDADIR/lib64/libcudart_static.a -lpthread -ldl /usr/lib/x86_64-linux-gnu/librt.so
clang++ -DUSE_SAME_CTX -isystem $DPCPP_HOME/build/include/sycl -isystem $CUDADIR/include -fsycl -fsycl-targets=nvptx64-nvidia-cuda-sycldevice -fsycl-unnamed-lambda reduced_2q_repro.cpp -o reduced_2q_repro_same_ctx.bin -Wl,-rpath,$CUDADIR/lib64: $DPCPP_HOME/build/lib/libsycl.so $CUDADIR/lib64/libcurand.so /usr/lib/x86_64-linux-gnu/libcuda.so $CUDADIR/lib64/libcudart_static.a -lpthread -ldl /usr/lib/x86_64-linux-gnu/librt.so
The first command builds an application which will produce invalid memory access. The second one will produce an application which won't produce invalid memory access. The only difference between these two applications is that the former will use different contexts for two queues. The latter application will use the same context for both queues.
Source code:
#include <CL/sycl.hpp>
#include <CL/sycl/backend/cuda.hpp>
#include <cuda.h>
#ifdef USE_SAME_CTX
# define CONSTRUCT_CTX(name, dev) sycl::context name{dev}
# define CONSTRUCT_Q(name, ctx, dev, ...) sycl::queue name{ctx, dev, __VA_ARGS__}
#else
# define CONSTRUCT_CTX(name, dev)
# define CONSTRUCT_Q(name, ctx, dev, ...) sycl::queue name{dev, __VA_ARGS__}
#endif
void generate(sycl::queue &Q, sycl::buffer<float> &R) {
Q.submit([&](sycl::handler &H) {
auto Acc = R.get_access<sycl::access::mode::read_write>(H);
H.host_task([=](sycl::interop_handle IH) {
auto Ptr = reinterpret_cast<float *>(IH.get_native_mem<sycl::backend::cuda>(Acc));
(void)Ptr;
});
}).wait_and_throw();
}
void test(sycl::device &D) {
std::int64_t n = 1000;
std::vector<float> RHost(n);
std::vector<float> R1Host(n);
CONSTRUCT_CTX(C, D);
try {
CONSTRUCT_Q(Q, C, D, [](sycl::exception_list EL) {
for (auto EP : EL)
std::rethrow_exception(EP);
});
sycl::buffer<float> R{RHost.data(), RHost.size()};
generate(Q, R);
Q.wait_and_throw();
} catch (const std::exception &E) {
std::cout << "Exception: " << E.what() << std::endl;
}
try {
CONSTRUCT_Q(Q, C, D, [](sycl::exception_list EL) {
for (auto EP : EL)
std::rethrow_exception(EP);
});
sycl::buffer<float> R1{R1Host.data(), R1Host.size()};
generate(Q, R1);
Q.wait_and_throw();
} catch (const std::exception &E) {
std::cout << "Exception: " << E.what() << std::endl;
}
}
int main() {
sycl::device D{sycl::gpu_selector{}};
std::cout << "Running on " << D.get_info<sycl::info::device::name>() << std::endl;
for (size_t Idx = 0; Idx < 1; ++Idx) {
std::cout << "Iteration " << Idx << " starts" << std::endl;
test(D);
std::cout << "Iteration " << Idx << " ends" << std::endl;
}
}
reduced_2q_repro.bin
will not fail. Though valgrind will report invalid memory accesses.
Payload of the application (i.e. actions before, after and inside host-task) can be modified in such a way that the application will eventually hang due to the same invalid memory access.
If the for
loop in main
has at least two iterations, the application is killed due to SEGFAULT which is due to the same memory access violation.
Culprit of this invalid memory access is that cuCtxSetCurrent
frees some memory when host-task attempts to wait for own dependencies.
Valgrind's log (with SYCL_PI_TRACE=-1
) is attached for reduced_2q_repro.bin
with only a single iteration of for
loop in main
.
Environment (please complete the following information):
- OS: Linux
- Target device: CUDA
- DPC++ version:
clang version 13.0.0 (git@github.com:intel/llvm.git 63ba1ceab4945c325485cb99b205558de6025f9a)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: ......
- CUDA version: 10.1
Additional context