Skip to content

Feature request: cross-run kernel caching for hipRTC #1143

@Noerr

Description

@Noerr

Problem

hipRTC-compiled kernels pay full compilation cost on every program run:

  1. Clang subprocess (Stage 0): HIP C++ → SPIR-V (~3–5s per kernel)
  2. SPIR-V translation (Stage 1): SPIR-V → LLVM IR via llvm-spirv
  3. Kernel codegen (Stage 2+3): LLVM optimization + codegen + linking → kernel.so

chipStar's module cache (~/.cache/chipStar/) cannot help because the SPIR-V
is non-deterministic across runs due to LLVM's internal non-determinism
(hash table iteration order, value naming counters — see
llvm/llvm-project#123791). The content-hash cache
key changes every run even for identical source.

A companion bug report #1142 suggests eliminating path-related non-determinism and disabling the
write-only cache to prevent unbounded disk growth. This feature request is
about actually making caching work for hipRTC.

Approaches

A. RTC-level cache in spirv_hiprtc.cc (recommended)

Cache the hipcc output (Clang offload bundle) keyed by the RTC inputs:

cache_key = hash(source_string + header_contents + compile_options
                 + hipcc_version + chipStar_build_id)

Pros:

  • Skips the entire Clang subprocess — the most expensive stage
  • Clean architecture: sits entirely in the RTC layer
  • Source + headers + options are all available in compile()
  • Deterministic by definition (no LLVM output in the key)

Cons:

  • Must include all headers passed via hiprtcCreateProgram() in the hash
  • Does NOT capture changes to headers resolved from -I paths on the
    filesystem (the user may #include "foo.h" where foo.h comes from a
    -I path, not from hiprtcCreateProgram). If such a header changes, the
    cache would serve stale compiled code.
  • Requires a cache invalidation strategy for compiler upgrades

Mitigation for filesystem header staleness:

  • Document that the cache only tracks headers explicitly provided via
    hiprtcCreateProgram(), not filesystem includes
  • Provide CHIP_RTC_CACHE_DIR="" to disable
  • Include hipcc binary mtime or a version hash in the cache key

B. Preprocessor-based cache key

Run clang's preprocessor (-E) to expand all #include directives, then
hash the preprocessed output:

cache_key = hash(preprocessed_source + compile_options + compiler_version)

Pros:

  • Captures all header content regardless of source (API headers, -I paths,
    system headers)
  • Fully correct — same guarantees as hashing the compilation output

Cons:

  • Adds a second clang invocation (preprocessor pass) — significant latency
  • The preprocessor output may itself be non-deterministic (e.g. __TIME__,
    __COUNTER__ macros) though these are unlikely in GPU kernels

C. SPIR-V canonicalization before hashing

Normalize SPIR-V to a canonical form before computing the cache key:
strip OpName/OpMemberName decorations, renumber IDs canonically.

Pros:

  • Works at the existing cache layer, no architectural changes
  • Correct — identical programs produce identical canonical form

Cons:

  • Requires a SPIR-V parser that understands the full instruction set
  • Must renumber ALL ID references (branches, type refs, etc.) — complex
  • The LLVM non-determinism may extend beyond naming to instruction
    ordering, which canonicalization cannot fix without semantic analysis

D. Hybrid: source-hash key + SPIR-V-hash validation

Use a source-based key for lookup but validate with SPIR-V hash:

primary_key   = hash(source + headers + options)
validation    = hash(spirv_bytes)

On cache hit, check if the SPIR-V validation hash matches. If not, the
cache entry is from a different compiler version — invalidate and
recompile.

Pros:

  • Deterministic lookup (source-based)
  • Detects compiler version changes automatically
  • Handles the -I path staleness problem when the compiler is unchanged
    but headers changed (SPIR-V hash would differ)

Cons:

  • Still has the LLVM non-determinism problem — the validation hash may
    not match even for identical source with an identical compiler, causing
    false invalidation. Would need the SPIR-V canonicalization from
    approach C to make the validation hash stable.

Recommendation

Approach A (RTC-level cache) is the most practical starting point. It
provides the largest speedup (skips the entire Clang subprocess) with the
simplest implementation. The filesystem header staleness issue is a
documented limitation that can be mitigated with a cache-clear mechanism.

The implementation would add ~50 lines to spirv_hiprtc.cc:

  1. In compile(), before invoking hipcc, compute the source-based cache key
  2. Check CHIP_RTC_CACHE_DIR (defaulting to ~/.cache/chipStar/rtc/)
  3. On hit: read cached bundle, skip hipcc invocation
  4. On miss: run hipcc as normal, write bundle to cache

This is independent of the existing module cache and would compose well
with it — the module cache can continue to operate on the (now cached)
SPIR-V for the SPIR-V→LLVM IR translation stage.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions