feat(fault-injection): Add CUDA Driver API interception for PyTorch/vLLM #4963

nv-oviya · 2025-12-15T18:10:24Z

Overview:

Enhances the CUDA fault injection library to intercept CUDA Driver API calls in addition to the existing Runtime API interception. This is critical because PyTorch and vLLM often bypass the Runtime API (cuda* functions) and call Driver API functions (cu* functions) directly, causing fault injection to be ineffective.

Details:

Problem: The original CUDA intercept library only hooked Runtime API functions (cudaGetDeviceCount, cudaMalloc, etc.). When testing fault injection on vLLM pods, faults weren't being triggered because vLLM/PyTorch uses the Driver API directly.

Solution: Added comprehensive Driver API interception:

cuInit, cuDeviceGetCount, cuDeviceGet
cuCtxCreate, cuCtxDestroy, cuCtxSynchronize, cuCtxGetCurrent
cuMemAlloc, cuMemFree, cuMemcpyDtoH, cuMemcpyHtoD
cuLaunchKernel, cuModuleLoad, cuModuleGetFunction

The library now intercepts both APIs and returns appropriate error codes based on the configured XID type (79, 48, 94, 95, 43, 74).

Where should the reviewer start?

tests/fault_tolerance/hardware/fault_injection_service/cuda_fault_injection/cuda_intercept.c - The main file with all Driver API interception additions (look at the cu* function implementations starting around line 200+)

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Relates to ongoing HW FT initiative

copy-pr-bot · 2025-12-15T18:10:27Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

The CUDA intercept library now intercepts both Runtime and Driver APIs: Runtime API (existing): - cudaGetDeviceCount, cudaGetDevice, cudaSetDevice - cudaMalloc, cudaFree, cudaMemcpy - cudaLaunchKernel, cudaDeviceSynchronize Driver API (new): - cuInit, cuDeviceGetCount, cuDeviceGet - cuCtxCreate, cuCtxDestroy, cuCtxSynchronize - cuMemAlloc, cuMemFree, cuMemcpyDtoH, cuMemcpyHtoD - cuLaunchKernel, cuModuleLoad, cuModuleGetFunction This ensures fault injection works with PyTorch/vLLM which often bypass the Runtime API and call Driver API functions directly. Signed-off-by: Oviya Seeniraj <oseeniraj@nvidia.com>

pull-request-size bot added the size/XL label Dec 15, 2025

github-actions bot added the feat label Dec 15, 2025

nv-oviya force-pushed the oviya/fault-injection/cuda-driver-intercept branch from aa0eeaf to adb27f8 Compare December 15, 2025 18:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(fault-injection): Add CUDA Driver API interception for PyTorch/vLLM #4963

feat(fault-injection): Add CUDA Driver API interception for PyTorch/vLLM #4963

nv-oviya commented Dec 15, 2025

Uh oh!

copy-pr-bot bot commented Dec 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(fault-injection): Add CUDA Driver API interception for PyTorch/vLLM #4963

Are you sure you want to change the base?

feat(fault-injection): Add CUDA Driver API interception for PyTorch/vLLM #4963

Conversation

nv-oviya commented Dec 15, 2025

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Uh oh!

copy-pr-bot bot commented Dec 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants