feat(fault-injection): Add CUDA Driver API interception for PyTorch/vLLM #4963
+613
−4
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overview:
Enhances the CUDA fault injection library to intercept CUDA Driver API calls in addition to the existing Runtime API interception. This is critical because PyTorch and vLLM often bypass the Runtime API (
cuda*functions) and call Driver API functions (cu*functions) directly, causing fault injection to be ineffective.Details:
Problem: The original CUDA intercept library only hooked Runtime API functions (
cudaGetDeviceCount,cudaMalloc, etc.). When testing fault injection on vLLM pods, faults weren't being triggered because vLLM/PyTorch uses the Driver API directly.Solution: Added comprehensive Driver API interception:
cuInit,cuDeviceGetCount,cuDeviceGetcuCtxCreate,cuCtxDestroy,cuCtxSynchronize,cuCtxGetCurrentcuMemAlloc,cuMemFree,cuMemcpyDtoH,cuMemcpyHtoDcuLaunchKernel,cuModuleLoad,cuModuleGetFunctionThe library now intercepts both APIs and returns appropriate error codes based on the configured XID type (79, 48, 94, 95, 43, 74).
Where should the reviewer start?
tests/fault_tolerance/hardware/fault_injection_service/cuda_fault_injection/cuda_intercept.c- The main file with all Driver API interception additions (look at thecu*function implementations starting around line 200+)Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Relates to ongoing HW FT initiative