Skip to content

Conversation

@nv-oviya
Copy link
Contributor

Overview:

Enhances the CUDA fault injection library to intercept CUDA Driver API calls in addition to the existing Runtime API interception. This is critical because PyTorch and vLLM often bypass the Runtime API (cuda* functions) and call Driver API functions (cu* functions) directly, causing fault injection to be ineffective.

Details:

Problem: The original CUDA intercept library only hooked Runtime API functions (cudaGetDeviceCount, cudaMalloc, etc.). When testing fault injection on vLLM pods, faults weren't being triggered because vLLM/PyTorch uses the Driver API directly.

Solution: Added comprehensive Driver API interception:

  • cuInit, cuDeviceGetCount, cuDeviceGet
  • cuCtxCreate, cuCtxDestroy, cuCtxSynchronize, cuCtxGetCurrent
  • cuMemAlloc, cuMemFree, cuMemcpyDtoH, cuMemcpyHtoD
  • cuLaunchKernel, cuModuleLoad, cuModuleGetFunction

The library now intercepts both APIs and returns appropriate error codes based on the configured XID type (79, 48, 94, 95, 43, 74).

Where should the reviewer start?

  • tests/fault_tolerance/hardware/fault_injection_service/cuda_fault_injection/cuda_intercept.c - The main file with all Driver API interception additions (look at the cu* function implementations starting around line 200+)

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Relates to ongoing HW FT initiative

@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 15, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

The CUDA intercept library now intercepts both Runtime and Driver APIs:

Runtime API (existing):
  - cudaGetDeviceCount, cudaGetDevice, cudaSetDevice
  - cudaMalloc, cudaFree, cudaMemcpy
  - cudaLaunchKernel, cudaDeviceSynchronize

Driver API (new):
  - cuInit, cuDeviceGetCount, cuDeviceGet
  - cuCtxCreate, cuCtxDestroy, cuCtxSynchronize
  - cuMemAlloc, cuMemFree, cuMemcpyDtoH, cuMemcpyHtoD
  - cuLaunchKernel, cuModuleLoad, cuModuleGetFunction

This ensures fault injection works with PyTorch/vLLM which often
bypass the Runtime API and call Driver API functions directly.

Signed-off-by: Oviya Seeniraj <oseeniraj@nvidia.com>
@nv-oviya nv-oviya force-pushed the oviya/fault-injection/cuda-driver-intercept branch from aa0eeaf to adb27f8 Compare December 15, 2025 18:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants