Feature request: cross-run kernel caching for hipRTC

## Problem

hipRTC-compiled kernels pay full compilation cost on every program run:
1. **Clang subprocess** (Stage 0): HIP C++ → SPIR-V (~3–5s per kernel)
2. **SPIR-V translation** (Stage 1): SPIR-V → LLVM IR via llvm-spirv
3. **Kernel codegen** (Stage 2+3): LLVM optimization + codegen + linking → kernel.so

chipStar's module cache (`~/.cache/chipStar/`) cannot help because the SPIR-V
is non-deterministic across runs due to LLVM's internal non-determinism
(hash table iteration order, value naming counters — see
https://github.com/llvm/llvm-project/issues/123791). The content-hash cache
key changes every run even for identical source.

A companion bug report https://github.com/CHIP-SPV/chipStar/issues/1142 suggests eliminating path-related non-determinism and disabling the
write-only cache to prevent unbounded disk growth. This feature request is
about actually making caching work for hipRTC.

## Approaches

### A. RTC-level cache in `spirv_hiprtc.cc` (recommended)

Cache the hipcc output (Clang offload bundle) keyed by the RTC inputs:

```
cache_key = hash(source_string + header_contents + compile_options
                 + hipcc_version + chipStar_build_id)
```

**Pros:**
- Skips the entire Clang subprocess — the most expensive stage
- Clean architecture: sits entirely in the RTC layer
- Source + headers + options are all available in `compile()`
- Deterministic by definition (no LLVM output in the key)

**Cons:**
- Must include all headers passed via `hiprtcCreateProgram()` in the hash
- Does NOT capture changes to headers resolved from `-I` paths on the
  filesystem (the user may `#include "foo.h"` where `foo.h` comes from a
  `-I` path, not from `hiprtcCreateProgram`). If such a header changes, the
  cache would serve stale compiled code.
- Requires a cache invalidation strategy for compiler upgrades

**Mitigation for filesystem header staleness:**
- Document that the cache only tracks headers explicitly provided via
  `hiprtcCreateProgram()`, not filesystem includes
- Provide `CHIP_RTC_CACHE_DIR=""` to disable
- Include hipcc binary mtime or a version hash in the cache key

### B. Preprocessor-based cache key

Run clang's preprocessor (`-E`) to expand all `#include` directives, then
hash the preprocessed output:

```
cache_key = hash(preprocessed_source + compile_options + compiler_version)
```

**Pros:**
- Captures all header content regardless of source (API headers, `-I` paths,
  system headers)
- Fully correct — same guarantees as hashing the compilation output

**Cons:**
- Adds a second clang invocation (preprocessor pass) — significant latency
- The preprocessor output may itself be non-deterministic (e.g. `__TIME__`,
  `__COUNTER__` macros) though these are unlikely in GPU kernels

### C. SPIR-V canonicalization before hashing

Normalize SPIR-V to a canonical form before computing the cache key:
strip `OpName`/`OpMemberName` decorations, renumber IDs canonically.

**Pros:**
- Works at the existing cache layer, no architectural changes
- Correct — identical programs produce identical canonical form

**Cons:**
- Requires a SPIR-V parser that understands the full instruction set
- Must renumber ALL ID references (branches, type refs, etc.) — complex
- The LLVM non-determinism may extend beyond naming to instruction
  ordering, which canonicalization cannot fix without semantic analysis

### D. Hybrid: source-hash key + SPIR-V-hash validation

Use a source-based key for lookup but validate with SPIR-V hash:

```
primary_key   = hash(source + headers + options)
validation    = hash(spirv_bytes)
```

On cache hit, check if the SPIR-V validation hash matches. If not, the
cache entry is from a different compiler version — invalidate and
recompile.

**Pros:**
- Deterministic lookup (source-based)
- Detects compiler version changes automatically
- Handles the `-I` path staleness problem when the compiler is unchanged
  but headers changed (SPIR-V hash would differ)

**Cons:**
- Still has the LLVM non-determinism problem — the validation hash may
  not match even for identical source with an identical compiler, causing
  false invalidation. Would need the SPIR-V canonicalization from
  approach C to make the validation hash stable.

## Recommendation

Approach A (RTC-level cache) is the most practical starting point. It
provides the largest speedup (skips the entire Clang subprocess) with the
simplest implementation. The filesystem header staleness issue is a
documented limitation that can be mitigated with a cache-clear mechanism.

The implementation would add ~50 lines to `spirv_hiprtc.cc`:
1. In `compile()`, before invoking hipcc, compute the source-based cache key
2. Check `CHIP_RTC_CACHE_DIR` (defaulting to `~/.cache/chipStar/rtc/`)
3. On hit: read cached bundle, skip hipcc invocation
4. On miss: run hipcc as normal, write bundle to cache

This is independent of the existing module cache and would compose well
with it — the module cache can continue to operate on the (now cached)
SPIR-V for the SPIR-V→LLVM IR translation stage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: cross-run kernel caching for hipRTC #1143

Problem

Approaches

A. RTC-level cache in `spirv_hiprtc.cc` (recommended)

B. Preprocessor-based cache key

C. SPIR-V canonicalization before hashing

D. Hybrid: source-hash key + SPIR-V-hash validation

Recommendation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature request: cross-run kernel caching for hipRTC #1143

Description

Problem

Approaches

A. RTC-level cache in spirv_hiprtc.cc (recommended)

B. Preprocessor-based cache key

C. SPIR-V canonicalization before hashing

D. Hybrid: source-hash key + SPIR-V-hash validation

Recommendation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

A. RTC-level cache in `spirv_hiprtc.cc` (recommended)