-
Notifications
You must be signed in to change notification settings - Fork 40
Description
Problem
hipRTC-compiled kernels pay full compilation cost on every program run:
- Clang subprocess (Stage 0): HIP C++ → SPIR-V (~3–5s per kernel)
- SPIR-V translation (Stage 1): SPIR-V → LLVM IR via llvm-spirv
- Kernel codegen (Stage 2+3): LLVM optimization + codegen + linking → kernel.so
chipStar's module cache (~/.cache/chipStar/) cannot help because the SPIR-V
is non-deterministic across runs due to LLVM's internal non-determinism
(hash table iteration order, value naming counters — see
llvm/llvm-project#123791). The content-hash cache
key changes every run even for identical source.
A companion bug report #1142 suggests eliminating path-related non-determinism and disabling the
write-only cache to prevent unbounded disk growth. This feature request is
about actually making caching work for hipRTC.
Approaches
A. RTC-level cache in spirv_hiprtc.cc (recommended)
Cache the hipcc output (Clang offload bundle) keyed by the RTC inputs:
cache_key = hash(source_string + header_contents + compile_options
+ hipcc_version + chipStar_build_id)
Pros:
- Skips the entire Clang subprocess — the most expensive stage
- Clean architecture: sits entirely in the RTC layer
- Source + headers + options are all available in
compile() - Deterministic by definition (no LLVM output in the key)
Cons:
- Must include all headers passed via
hiprtcCreateProgram()in the hash - Does NOT capture changes to headers resolved from
-Ipaths on the
filesystem (the user may#include "foo.h"wherefoo.hcomes from a
-Ipath, not fromhiprtcCreateProgram). If such a header changes, the
cache would serve stale compiled code. - Requires a cache invalidation strategy for compiler upgrades
Mitigation for filesystem header staleness:
- Document that the cache only tracks headers explicitly provided via
hiprtcCreateProgram(), not filesystem includes - Provide
CHIP_RTC_CACHE_DIR=""to disable - Include hipcc binary mtime or a version hash in the cache key
B. Preprocessor-based cache key
Run clang's preprocessor (-E) to expand all #include directives, then
hash the preprocessed output:
cache_key = hash(preprocessed_source + compile_options + compiler_version)
Pros:
- Captures all header content regardless of source (API headers,
-Ipaths,
system headers) - Fully correct — same guarantees as hashing the compilation output
Cons:
- Adds a second clang invocation (preprocessor pass) — significant latency
- The preprocessor output may itself be non-deterministic (e.g.
__TIME__,
__COUNTER__macros) though these are unlikely in GPU kernels
C. SPIR-V canonicalization before hashing
Normalize SPIR-V to a canonical form before computing the cache key:
strip OpName/OpMemberName decorations, renumber IDs canonically.
Pros:
- Works at the existing cache layer, no architectural changes
- Correct — identical programs produce identical canonical form
Cons:
- Requires a SPIR-V parser that understands the full instruction set
- Must renumber ALL ID references (branches, type refs, etc.) — complex
- The LLVM non-determinism may extend beyond naming to instruction
ordering, which canonicalization cannot fix without semantic analysis
D. Hybrid: source-hash key + SPIR-V-hash validation
Use a source-based key for lookup but validate with SPIR-V hash:
primary_key = hash(source + headers + options)
validation = hash(spirv_bytes)
On cache hit, check if the SPIR-V validation hash matches. If not, the
cache entry is from a different compiler version — invalidate and
recompile.
Pros:
- Deterministic lookup (source-based)
- Detects compiler version changes automatically
- Handles the
-Ipath staleness problem when the compiler is unchanged
but headers changed (SPIR-V hash would differ)
Cons:
- Still has the LLVM non-determinism problem — the validation hash may
not match even for identical source with an identical compiler, causing
false invalidation. Would need the SPIR-V canonicalization from
approach C to make the validation hash stable.
Recommendation
Approach A (RTC-level cache) is the most practical starting point. It
provides the largest speedup (skips the entire Clang subprocess) with the
simplest implementation. The filesystem header staleness issue is a
documented limitation that can be mitigated with a cache-clear mechanism.
The implementation would add ~50 lines to spirv_hiprtc.cc:
- In
compile(), before invoking hipcc, compute the source-based cache key - Check
CHIP_RTC_CACHE_DIR(defaulting to~/.cache/chipStar/rtc/) - On hit: read cached bundle, skip hipcc invocation
- On miss: run hipcc as normal, write bundle to cache
This is independent of the existing module cache and would compose well
with it — the module cache can continue to operate on the (now cached)
SPIR-V for the SPIR-V→LLVM IR translation stage.