[BUG] Memory free leads to illegal memory access #1561

jglaser · 2021-06-07T00:04:39Z

Describe the bug

With 90x16GB workers, query 2 of NVIDIA GPU leads to this log entry and subsequent crash

2021-06-06 19:47:27.279|13|info|498234689|||MemoryMonitor about to free memory from tasks|||||
2021-06-06 19:47:27.279|13|info|498234689|||MemoryMonitor successfully freed memory from tasks|||||
2021-06-06 19:49:08.461|13|info|498234689|||MemoryMonitor about to free memory from tasks|||||
2021-06-06 19:49:08.461|13|info|498234689|||MemoryMonitor successfully freed memory from tasks|||||
2021-06-06 19:49:36.834|13|debug|498234689|8|8|Compute Aggregate Kernel tasks created|495373|kernel_id|8||
2021-06-06 19:49:37.514|13|error||||ERROR in BlazingHostTable::get_gpu_table(). What: std::bad_alloc: CUDA error at: /sw/summit/ums/gen119/nvrapids_0.19_gcc_9.3.0/include/rmm/mr/device/managed_memory_resource.hpp:73: cudaErrorIllegalAddress an illegal memory access was encountered|||||
2021-06-06 19:49:37.515|13|error||||ERROR of type rmm::bad_alloc in task::run. What: std::bad_alloc: CUDA error at: /sw/summit/ums/gen119/nvrapids_0.19_gcc_9.3.0/include/rmm/mr/device/managed_memory_resource.hpp:73: cudaErrorIllegalAddress an illegal memory access was encountered|||||
2021-06-06 19:49:37.515|13|error||||ERROR in BlazingHostTable::get_gpu_table(). What: std::bad_alloc: CUDA error at: /sw/summit/ums/gen119/nvrapids_0.19_gcc_9.3.0/include/rmm/mr/device/managed_memory_resource.hpp:73: cudaErrorIllegalAddress an illegal memory access was encountered|||||

Could this be a race condition between a kernel consuming some cache allocation which is freed by the MemoryMonitor? Is there a lock in place to prevent this?

Steps/Code to reproduce bug
Run GPU BDB benchmark (SF10K) on 16 GB GPUs with --rmm-managed-memory, BLAZING_ALLOCATOR_MODE=existing, and --memory-limit 45GB

Expected behavior

No crash

Environment overview (please complete the following information)
ppc64le, CUDA 11, BlazingSQL 0.19

Environment details

Additional context

The text was updated successfully, but these errors were encountered:

wmalpica · 2021-06-11T19:44:31Z

I doubt that the memory monitor is the guilty party here, considering that the memory monitor freed something 30 seconds before the illegal memory access. But this is definitely a big problem and i am looking into it

jglaser added bug Something isn't working ? - Needs Triage needs team to review and classify labels Jun 7, 2021

wmalpica removed the ? - Needs Triage needs team to review and classify label Jun 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Memory free leads to illegal memory access #1561

[BUG] Memory free leads to illegal memory access #1561

jglaser commented Jun 7, 2021

wmalpica commented Jun 11, 2021

[BUG] Memory free leads to illegal memory access #1561

[BUG] Memory free leads to illegal memory access #1561

Comments

jglaser commented Jun 7, 2021

wmalpica commented Jun 11, 2021