Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Memory free leads to illegal memory access #1561

Open
jglaser opened this issue Jun 7, 2021 · 1 comment
Open

[BUG] Memory free leads to illegal memory access #1561

jglaser opened this issue Jun 7, 2021 · 1 comment
Labels
bug Something isn't working

Comments

@jglaser
Copy link
Collaborator

jglaser commented Jun 7, 2021

Describe the bug

With 90x16GB workers, query 2 of NVIDIA GPU leads to this log entry and subsequent crash

2021-06-06 19:47:27.279|13|info|498234689|||MemoryMonitor about to free memory from tasks|||||
2021-06-06 19:47:27.279|13|info|498234689|||MemoryMonitor successfully freed memory from tasks|||||
2021-06-06 19:49:08.461|13|info|498234689|||MemoryMonitor about to free memory from tasks|||||
2021-06-06 19:49:08.461|13|info|498234689|||MemoryMonitor successfully freed memory from tasks|||||
2021-06-06 19:49:36.834|13|debug|498234689|8|8|Compute Aggregate Kernel tasks created|495373|kernel_id|8||
2021-06-06 19:49:37.514|13|error||||ERROR in BlazingHostTable::get_gpu_table(). What: std::bad_alloc: CUDA error at: /sw/summit/ums/gen119/nvrapids_0.19_gcc_9.3.0/include/rmm/mr/device/managed_memory_resource.hpp:73: cudaErrorIllegalAddress an illegal memory access was encountered|||||
2021-06-06 19:49:37.515|13|error||||ERROR of type rmm::bad_alloc in task::run. What: std::bad_alloc: CUDA error at: /sw/summit/ums/gen119/nvrapids_0.19_gcc_9.3.0/include/rmm/mr/device/managed_memory_resource.hpp:73: cudaErrorIllegalAddress an illegal memory access was encountered|||||
2021-06-06 19:49:37.515|13|error||||ERROR in BlazingHostTable::get_gpu_table(). What: std::bad_alloc: CUDA error at: /sw/summit/ums/gen119/nvrapids_0.19_gcc_9.3.0/include/rmm/mr/device/managed_memory_resource.hpp:73: cudaErrorIllegalAddress an illegal memory access was encountered|||||

Could this be a race condition between a kernel consuming some cache allocation which is freed by the MemoryMonitor? Is there a lock in place to prevent this?

Steps/Code to reproduce bug
Run GPU BDB benchmark (SF10K) on 16 GB GPUs with --rmm-managed-memory, BLAZING_ALLOCATOR_MODE=existing, and --memory-limit 45GB

Expected behavior

No crash

Environment overview (please complete the following information)
ppc64le, CUDA 11, BlazingSQL 0.19

Environment details

Additional context

@jglaser jglaser added bug Something isn't working ? - Needs Triage needs team to review and classify labels Jun 7, 2021
@wmalpica wmalpica removed the ? - Needs Triage needs team to review and classify label Jun 11, 2021
@wmalpica
Copy link
Contributor

I doubt that the memory monitor is the guilty party here, considering that the memory monitor freed something 30 seconds before the illegal memory access. But this is definitely a big problem and i am looking into it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants