You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
With 90x16GB workers, query 2 of NVIDIA GPU leads to this log entry and subsequent crash
2021-06-06 19:47:27.279|13|info|498234689|||MemoryMonitor about to free memory from tasks|||||
2021-06-06 19:47:27.279|13|info|498234689|||MemoryMonitor successfully freed memory from tasks|||||
2021-06-06 19:49:08.461|13|info|498234689|||MemoryMonitor about to free memory from tasks|||||
2021-06-06 19:49:08.461|13|info|498234689|||MemoryMonitor successfully freed memory from tasks|||||
2021-06-06 19:49:36.834|13|debug|498234689|8|8|Compute Aggregate Kernel tasks created|495373|kernel_id|8||
2021-06-06 19:49:37.514|13|error||||ERROR in BlazingHostTable::get_gpu_table(). What: std::bad_alloc: CUDA error at: /sw/summit/ums/gen119/nvrapids_0.19_gcc_9.3.0/include/rmm/mr/device/managed_memory_resource.hpp:73: cudaErrorIllegalAddress an illegal memory access was encountered|||||
2021-06-06 19:49:37.515|13|error||||ERROR of type rmm::bad_alloc in task::run. What: std::bad_alloc: CUDA error at: /sw/summit/ums/gen119/nvrapids_0.19_gcc_9.3.0/include/rmm/mr/device/managed_memory_resource.hpp:73: cudaErrorIllegalAddress an illegal memory access was encountered|||||
2021-06-06 19:49:37.515|13|error||||ERROR in BlazingHostTable::get_gpu_table(). What: std::bad_alloc: CUDA error at: /sw/summit/ums/gen119/nvrapids_0.19_gcc_9.3.0/include/rmm/mr/device/managed_memory_resource.hpp:73: cudaErrorIllegalAddress an illegal memory access was encountered|||||
Could this be a race condition between a kernel consuming some cache allocation which is freed by the MemoryMonitor? Is there a lock in place to prevent this?
Steps/Code to reproduce bug
Run GPU BDB benchmark (SF10K) on 16 GB GPUs with --rmm-managed-memory, BLAZING_ALLOCATOR_MODE=existing, and --memory-limit 45GB
Expected behavior
No crash
Environment overview (please complete the following information)
ppc64le, CUDA 11, BlazingSQL 0.19
Environment details
Additional context
The text was updated successfully, but these errors were encountered:
I doubt that the memory monitor is the guilty party here, considering that the memory monitor freed something 30 seconds before the illegal memory access. But this is definitely a big problem and i am looking into it
Describe the bug
With 90x16GB workers, query 2 of NVIDIA GPU leads to this log entry and subsequent crash
Could this be a race condition between a kernel consuming some cache allocation which is freed by the MemoryMonitor? Is there a lock in place to prevent this?
Steps/Code to reproduce bug
Run GPU BDB benchmark (SF10K) on 16 GB GPUs with
--rmm-managed-memory
,BLAZING_ALLOCATOR_MODE=existing
, and--memory-limit 45GB
Expected behavior
No crash
Environment overview (please complete the following information)
ppc64le, CUDA 11, BlazingSQL 0.19
Environment details
Additional context
The text was updated successfully, but these errors were encountered: