Skip to content

Conversation

@dr-xjx
Copy link
Contributor

@dr-xjx dr-xjx commented Dec 1, 2025

  • Introduced two new map types: BPF_MAP_TYPE_PERGPUTD_ARRAY_HOST_MAP and BPF_MAP_TYPE_GPU_ARRAY_HOST_MAP for Tegra platforms without CUDA IPC.
  • Updated default_trampoline.cu to handle new host-based map types in the BPF helper functions.
  • Created host_map_test.bpf.c and host_map_test.c to demonstrate usage of the new host-backed maps, including per-thread and shared storage.
  • Enhanced the build system with a Makefile and README for the new example, detailing usage and requirements.

This change improves memory management and flexibility for applications running on platforms lacking CUDA IPC support, enabling efficient data sharing between CPU and GPU.

Test Method and Cases
Prerequisites
Test Case 1: Basic Functionality
Purpose: Verify both map types work correctly
Expected Result:
shared_counter shows values for keys 0-9 (threads mod 10) perthread_counter shows per-thread call counts, execution times, and thread IDs thread_timestamp shows active thread count

Please try to use the copilot to summary your PR. You don't need to fill all info below, just it can help giving your a checklist.

Description

Fixes # (issue)

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

  • Test A
  • Test B

Test Configuration:

  • Firmware version:
  • Hardware:
  • Toolchain:
  • SDK:

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules
  • I have checked my code and corrected any misspellings

- Introduced two new map types: BPF_MAP_TYPE_PERGPUTD_ARRAY_HOST_MAP and BPF_MAP_TYPE_GPU_ARRAY_HOST_MAP for Tegra platforms without CUDA IPC.
- Updated default_trampoline.cu to handle new host-based map types in the BPF helper functions.
- Created host_map_test.bpf.c and host_map_test.c to demonstrate usage of the new host-backed maps, including per-thread and shared storage.
- Enhanced the build system with a Makefile and README for the new example, detailing usage and requirements.

This change improves memory management and flexibility for applications running on platforms lacking CUDA IPC support, enabling efficient data sharing between CPU and GPU.

Test Method and Cases
Prerequisites
Test Case 1: Basic Functionality
Purpose: Verify both map types work correctly
Expected Result:
shared_counter shows values for keys 0-9 (threads mod 10)
perthread_counter shows per-thread call counts, execution times, and thread IDs
thread_timestamp shows active thread count

Signed-off-by: jingxuanxie <jingxuanxie@deeproute.ai>
@yunwei37
Copy link
Member

yunwei37 commented Dec 1, 2025

Thanks!

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for host-backed GPU map types designed for Tegra platforms that lack CUDA IPC support. The implementation introduces two new map types that store data in host memory (accessible via boost::interprocess shared memory + cudaHostRegister) rather than GPU device memory (cuMemAlloc + CUDA IPC), enabling efficient CPU-GPU data sharing on platforms without IPC capabilities.

Key Changes:

  • Added BPF_MAP_TYPE_PERGPUTD_ARRAY_HOST_MAP (1512) for per-GPU-thread storage and BPF_MAP_TYPE_GPU_ARRAY_HOST_MAP (1513) for shared storage, both backed by host memory
  • Implemented memory synchronization using std::atomic_thread_fence on CPU side and membar.sys on GPU side for proper visibility guarantees
  • Included comprehensive test example with BPF program, userspace monitor, and CUDA application

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
runtime/include/bpftime_shm.hpp Added new map type enumerations for host-backed GPU maps (1512, 1513) with clear documentation distinguishing IPC vs host-backed implementations
runtime/src/handler/map_handler.cpp Extended map handler with lookup, update, delete, and iteration support for both new host-backed map types, including userspace value size calculations
runtime/src/bpf_map/gpu/nv_gpu_array_host_map.{hpp,cpp} Implemented per-thread host-backed map with boost::interprocess shared memory and proper memory barriers for CPU-GPU synchronization
runtime/src/bpf_map/gpu/nv_gpu_shared_array_host_map.{hpp,cpp} Implemented shared host-backed map with single-copy storage accessible by all GPU threads
runtime/src/bpf_map/gpu/nv_gpu_{shared_array,per_thread_array,ringbuf}_map.cpp Added memory barriers to existing GPU maps for consistency with new host-backed implementations
runtime/CMakeLists.txt Added new source files to build system with clear comments distinguishing IPC-based vs host-based implementations
attach/nv_attach_impl/trampoline/default_trampoline.cu Extended BPF helper functions (map_lookup_elem, map_update_elem) to handle host-backed map types with appropriate memory barriers
example/gpu/host_map_test/host_map_test.bpf.c BPF program demonstrating per-thread and shared host-backed maps with kprobe/kretprobe on CUDA kernel
example/gpu/host_map_test/host_map_test.c Userspace program that reads and displays statistics from host-backed maps using dynamic key iteration
example/gpu/host_map_test/vec_add.cu Simple CUDA vector addition application for triggering BPF probes and testing map functionality
example/gpu/host_map_test/Makefile Build system for example with support for libbpf, bpftool, and CUDA compilation
example/gpu/host_map_test/README.md Comprehensive documentation covering map types, use cases, building, running, and troubleshooting
Comments suppressed due to low confidence (4)

attach/nv_attach_impl/trampoline/default_trampoline.cu:224

  • real_key is used without bounds checking to compute a per-thread offset, which can cause out-of-bounds read/write in host memory. A malicious or buggy BPF program can pass a large key to this helper and obtain a pointer past the allocated map region, leading to host memory corruption when used. Add an explicit check against map_info.max_entries (and reject/return 0 on failure) before computing the offset:
auto real_key = *(uint32_t *)(uintptr_t)key;
if ((uint64_t)real_key >= (uint64_t)map_info.max_entries) {
    return 0; // or error code
}
auto offset = array_map_offset(real_key, map_info, map);
		auto offset = array_map_offset(real_key, map_info, map);

attach/nv_attach_impl/trampoline/default_trampoline.cu:276

  • real_key is used to compute a destination pointer without validating it against map_info.max_entries, enabling out-of-bounds writes to host memory. An attacker controlling the map key can write past the allocated buffer via simple_memcpy, corrupting adjacent host memory. Validate the key before computing the offset or performing the copy:
auto real_key = *(uint32_t *)(uintptr_t)key;
if ((uint64_t)real_key >= (uint64_t)map_info.max_entries) {
    return (uint64_t)-1; // or appropriate error handling
}
auto offset = array_map_offset(real_key, map_info, map);
		auto offset = array_map_offset(real_key, map_info, map);

attach/nv_attach_impl/trampoline/default_trampoline.cu:233

  • Out-of-bounds access risk: real_key is used directly in base + real_key * map_info.value_size with no check against map_info.max_entries. A crafted key can cause the helper to return a pointer outside the map buffer, leading to OOB reads/writes by callers. Add a bounds check before pointer arithmetic and return 0 on invalid keys:
auto real_key = *(uint32_t *)(uintptr_t)key;
if ((uint64_t)real_key >= (uint64_t)map_info.max_entries) {
    return 0;
}
auto base = (char *)map_info.extra_buffer;
return (uint64_t)(uintptr_t)(base + (uint64_t)real_key * map_info.value_size);
		return (uint64_t)(uintptr_t)(base + (uint64_t)real_key * map_info.value_size);

attach/nv_attach_impl/trampoline/default_trampoline.cu:286

  • The destination pointer dst is derived from base + real_key * map_info.value_size without validating real_key, allowing out-of-bounds writes into host memory. An attacker can pass an oversized key to corrupt memory adjacent to the map buffer via simple_memcpy. Guard this path by checking the key against map_info.max_entries before computing dst:
auto real_key = *(uint32_t *)(uintptr_t)key;
if ((uint64_t)real_key >= (uint64_t)map_info.max_entries) {
    return (uint64_t)-1; // or set an error result
}
auto base = (char *)map_info.extra_buffer;
auto dst = (void *)(uintptr_t)(base + (uint64_t)real_key * map_info.value_size);
		auto dst = (void *)(uintptr_t)(base + (uint64_t)real_key * map_info.value_size);

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.

@Forsworns
Copy link
Contributor

Does this implementation also work on general Tesla/GeForce platform? Or it is specific to Tegra? If so I guess it would be better to distinguish them during compilation configuration.

@dr-xjx
Copy link
Contributor Author

dr-xjx commented Dec 2, 2025

Does this implementation also work on general Tesla/GeForce platform? Or it is specific to Tegra? If so I guess it would be better to distinguish them during compilation configuration.

This is a universal solution that can be used on platforms that do not support CUDA IPC. For platforms that support CUDA IPC, BPF_MAP_TYPE_PERGPUTD_ARRAY_MAP and BPF_MAP_TYPE_GPU_ARRAY_MAP are preferred as they allocate memory on the device, enabling faster GPU access.

- Add HOST_MAP_MAX_ENTRIES variable with default value 10
- Pass HOST_MAP_MAX_ENTRIES to both BPF and C compilation commands
- Enable the documented feature allowing users to customize map entries
  via 'make HOST_MAP_MAX_ENTRIES=N'

Signed-off-by: jingxuanxie <jingxuanxie@deeproute.ai>
@dr-xjx dr-xjx force-pushed the feature/host-backed-gpu-maps branch from 149142a to 0e34818 Compare December 2, 2025 03:56
@dr-xjx
Copy link
Contributor Author

dr-xjx commented Dec 3, 2025

@Officeyutong Could you please review this PR and help assess whether it can be merged?

@Officeyutong
Copy link
Contributor

@Officeyutong Could you please review this PR and help assess whether it can be merged?

Please resolve merge conflicts after that I will review this PR

if (did_switch_ctx) {
cuCtxSetCurrent(prev_ctx);
}
// Memory barrier: ensure GPU data is visible to CPU
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the situation where both CPU and GPU access the same variable, since some memory models are weakly consistent, to be on the safe side, I think a memory barrier operation is needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change so many trampoline?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I modified default_trampoline.cu, I need to recompile trampoline_ptx.h. I used clang18, so the changes look quite significant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants