-
-
Notifications
You must be signed in to change notification settings - Fork 140
Add support for Host-backed GPU maps #522
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
- Introduced two new map types: BPF_MAP_TYPE_PERGPUTD_ARRAY_HOST_MAP and BPF_MAP_TYPE_GPU_ARRAY_HOST_MAP for Tegra platforms without CUDA IPC. - Updated default_trampoline.cu to handle new host-based map types in the BPF helper functions. - Created host_map_test.bpf.c and host_map_test.c to demonstrate usage of the new host-backed maps, including per-thread and shared storage. - Enhanced the build system with a Makefile and README for the new example, detailing usage and requirements. This change improves memory management and flexibility for applications running on platforms lacking CUDA IPC support, enabling efficient data sharing between CPU and GPU. Test Method and Cases Prerequisites Test Case 1: Basic Functionality Purpose: Verify both map types work correctly Expected Result: shared_counter shows values for keys 0-9 (threads mod 10) perthread_counter shows per-thread call counts, execution times, and thread IDs thread_timestamp shows active thread count Signed-off-by: jingxuanxie <jingxuanxie@deeproute.ai>
|
Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds support for host-backed GPU map types designed for Tegra platforms that lack CUDA IPC support. The implementation introduces two new map types that store data in host memory (accessible via boost::interprocess shared memory + cudaHostRegister) rather than GPU device memory (cuMemAlloc + CUDA IPC), enabling efficient CPU-GPU data sharing on platforms without IPC capabilities.
Key Changes:
- Added
BPF_MAP_TYPE_PERGPUTD_ARRAY_HOST_MAP(1512) for per-GPU-thread storage andBPF_MAP_TYPE_GPU_ARRAY_HOST_MAP(1513) for shared storage, both backed by host memory - Implemented memory synchronization using
std::atomic_thread_fenceon CPU side andmembar.syson GPU side for proper visibility guarantees - Included comprehensive test example with BPF program, userspace monitor, and CUDA application
Reviewed changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| runtime/include/bpftime_shm.hpp | Added new map type enumerations for host-backed GPU maps (1512, 1513) with clear documentation distinguishing IPC vs host-backed implementations |
| runtime/src/handler/map_handler.cpp | Extended map handler with lookup, update, delete, and iteration support for both new host-backed map types, including userspace value size calculations |
| runtime/src/bpf_map/gpu/nv_gpu_array_host_map.{hpp,cpp} | Implemented per-thread host-backed map with boost::interprocess shared memory and proper memory barriers for CPU-GPU synchronization |
| runtime/src/bpf_map/gpu/nv_gpu_shared_array_host_map.{hpp,cpp} | Implemented shared host-backed map with single-copy storage accessible by all GPU threads |
| runtime/src/bpf_map/gpu/nv_gpu_{shared_array,per_thread_array,ringbuf}_map.cpp | Added memory barriers to existing GPU maps for consistency with new host-backed implementations |
| runtime/CMakeLists.txt | Added new source files to build system with clear comments distinguishing IPC-based vs host-based implementations |
| attach/nv_attach_impl/trampoline/default_trampoline.cu | Extended BPF helper functions (map_lookup_elem, map_update_elem) to handle host-backed map types with appropriate memory barriers |
| example/gpu/host_map_test/host_map_test.bpf.c | BPF program demonstrating per-thread and shared host-backed maps with kprobe/kretprobe on CUDA kernel |
| example/gpu/host_map_test/host_map_test.c | Userspace program that reads and displays statistics from host-backed maps using dynamic key iteration |
| example/gpu/host_map_test/vec_add.cu | Simple CUDA vector addition application for triggering BPF probes and testing map functionality |
| example/gpu/host_map_test/Makefile | Build system for example with support for libbpf, bpftool, and CUDA compilation |
| example/gpu/host_map_test/README.md | Comprehensive documentation covering map types, use cases, building, running, and troubleshooting |
Comments suppressed due to low confidence (4)
attach/nv_attach_impl/trampoline/default_trampoline.cu:224
real_keyis used without bounds checking to compute a per-thread offset, which can cause out-of-bounds read/write in host memory. A malicious or buggy BPF program can pass a large key to this helper and obtain a pointer past the allocated map region, leading to host memory corruption when used. Add an explicit check againstmap_info.max_entries(and reject/return 0 on failure) before computing the offset:
auto real_key = *(uint32_t *)(uintptr_t)key;
if ((uint64_t)real_key >= (uint64_t)map_info.max_entries) {
return 0; // or error code
}
auto offset = array_map_offset(real_key, map_info, map); auto offset = array_map_offset(real_key, map_info, map);
attach/nv_attach_impl/trampoline/default_trampoline.cu:276
real_keyis used to compute a destination pointer without validating it againstmap_info.max_entries, enabling out-of-bounds writes to host memory. An attacker controlling the map key can write past the allocated buffer viasimple_memcpy, corrupting adjacent host memory. Validate the key before computing the offset or performing the copy:
auto real_key = *(uint32_t *)(uintptr_t)key;
if ((uint64_t)real_key >= (uint64_t)map_info.max_entries) {
return (uint64_t)-1; // or appropriate error handling
}
auto offset = array_map_offset(real_key, map_info, map); auto offset = array_map_offset(real_key, map_info, map);
attach/nv_attach_impl/trampoline/default_trampoline.cu:233
- Out-of-bounds access risk:
real_keyis used directly inbase + real_key * map_info.value_sizewith no check againstmap_info.max_entries. A crafted key can cause the helper to return a pointer outside the map buffer, leading to OOB reads/writes by callers. Add a bounds check before pointer arithmetic and return 0 on invalid keys:
auto real_key = *(uint32_t *)(uintptr_t)key;
if ((uint64_t)real_key >= (uint64_t)map_info.max_entries) {
return 0;
}
auto base = (char *)map_info.extra_buffer;
return (uint64_t)(uintptr_t)(base + (uint64_t)real_key * map_info.value_size); return (uint64_t)(uintptr_t)(base + (uint64_t)real_key * map_info.value_size);
attach/nv_attach_impl/trampoline/default_trampoline.cu:286
- The destination pointer
dstis derived frombase + real_key * map_info.value_sizewithout validatingreal_key, allowing out-of-bounds writes into host memory. An attacker can pass an oversized key to corrupt memory adjacent to the map buffer viasimple_memcpy. Guard this path by checking the key againstmap_info.max_entriesbefore computingdst:
auto real_key = *(uint32_t *)(uintptr_t)key;
if ((uint64_t)real_key >= (uint64_t)map_info.max_entries) {
return (uint64_t)-1; // or set an error result
}
auto base = (char *)map_info.extra_buffer;
auto dst = (void *)(uintptr_t)(base + (uint64_t)real_key * map_info.value_size); auto dst = (void *)(uintptr_t)(base + (uint64_t)real_key * map_info.value_size);
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.
|
Does this implementation also work on general Tesla/GeForce platform? Or it is specific to Tegra? If so I guess it would be better to distinguish them during compilation configuration. |
This is a universal solution that can be used on platforms that do not support CUDA IPC. For platforms that support CUDA IPC, BPF_MAP_TYPE_PERGPUTD_ARRAY_MAP and BPF_MAP_TYPE_GPU_ARRAY_MAP are preferred as they allocate memory on the device, enabling faster GPU access. |
- Add HOST_MAP_MAX_ENTRIES variable with default value 10 - Pass HOST_MAP_MAX_ENTRIES to both BPF and C compilation commands - Enable the documented feature allowing users to customize map entries via 'make HOST_MAP_MAX_ENTRIES=N' Signed-off-by: jingxuanxie <jingxuanxie@deeproute.ai>
149142a to
0e34818
Compare
|
@Officeyutong Could you please review this PR and help assess whether it can be merged? |
Please resolve merge conflicts after that I will review this PR |
| if (did_switch_ctx) { | ||
| cuCtxSetCurrent(prev_ctx); | ||
| } | ||
| // Memory barrier: ensure GPU data is visible to CPU |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the situation where both CPU and GPU access the same variable, since some memory models are weakly consistent, to be on the safe side, I think a memory barrier operation is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why change so many trampoline?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I modified default_trampoline.cu, I need to recompile trampoline_ptx.h. I used clang18, so the changes look quite significant.
This change improves memory management and flexibility for applications running on platforms lacking CUDA IPC support, enabling efficient data sharing between CPU and GPU.
Test Method and Cases
Prerequisites
Test Case 1: Basic Functionality
Purpose: Verify both map types work correctly
Expected Result:
shared_counter shows values for keys 0-9 (threads mod 10) perthread_counter shows per-thread call counts, execution times, and thread IDs thread_timestamp shows active thread count
Please try to use the copilot to summary your PR. You don't need to fill all info below, just it can help giving your a checklist.
Description
Fixes # (issue)
Type of change
How Has This Been Tested?
Test Configuration:
Checklist