Skip to content

[Feature] Add MESH expert residency with io_uring direct I/O for MoE inference#2003

Open
RaQiu wants to merge 57 commits into
kvcache-ai:mainfrom
RaQiu:main
Open

[Feature] Add MESH expert residency with io_uring direct I/O for MoE inference#2003
RaQiu wants to merge 57 commits into
kvcache-ai:mainfrom
RaQiu:main

Conversation

@RaQiu
Copy link
Copy Markdown

@RaQiu RaQiu commented May 13, 2026

What does this PR do?

This PR introduces MESH, an experimental memory-tiered expert residency system for KTransformers MoE inference.

MESH is designed for heterogeneous CPU-GPU MoE serving when the full expert working set cannot stay comfortably resident in host DRAM. The current KT path works well when expert weights are available from normal memory-backed storage, but constrained-memory deployments expose a difficult systems problem: expert weights are sparse at runtime, but the full expert set must remain accessible. Relying only on mmap leaves expert residency mostly under OS page-cache control.

MESH adds an explicit runtime-managed expert residency layer for AMXINT4 MoE inference. It can load expert weights from NVMe through io_uring + O_DIRECT into NUMA-local CPU buffers, manage a bounded resident expert slot pool, and preserve the existing KT AMX compute path once the required experts are resident.

The goal is not to replace KT's AMX kernels. The goal is to make expert residency explicit: which experts are in CPU memory, which experts are cold, when cold experts are read, and how prefill/decode should share the resident slot pool.

Why this is useful

On constrained-memory machines, mmap makes MoE expert residency hard to control:

  • The OS page cache can hold a hidden copy of expert weights.
  • NUMA-local CPU execution may require a second application-managed copy.
  • Page-cache reclaim timing is not visible to the runtime.
  • Page faults happen synchronously on the critical path.
  • The runtime cannot directly express that some experts should remain hot while others can be evicted.

MESH gives KTransformers a runtime-level mechanism to manage this explicitly:

  • expert weights can be read directly into NUMA-local buffers,
  • resident experts are tracked by layer and expert id,
  • cold expert reads are issued through async io_uring,
  • promotion and demotion are visible to the runtime,
  • GPU-resident experts remain outside CPU resident-slot management,
  • existing AMX compute kernels are reused after residency is resolved.

This is useful for local single-node MoE serving, workstation deployments, and memory-constrained CPU-GPU systems where CPU AMX, GPU attention, NUMA locality, and NVMe-backed expert storage need to work together.

Main components

1. io_uring direct-I/O expert loading

MESH adds an io_uring-based loading path for expert weights. With O_DIRECT, expert tensors can be read from storage directly into application-owned buffers, avoiding the OS page cache on the MESH path.

The async reader tracks request completion explicitly and validates read results. It also includes retry/validation logic for failed or incomplete reads.

2. NUMA-local resident expert slots

MESH adds a resident slot pool for CPU-managed experts. Each resident expert is associated with slot-owned buffers for its expert tensors. The slot metadata tracks expert state, slot state, and active readers so that a resident expert is not evicted while it is being used by the AMX forward path.

The slot pool is virtual: a slot is not permanently tied to a fixed expert id. It can be rebound to different experts as promotion and eviction happen.

3. Batched cold expert promotion

For a layer forward, MESH can collect the CPU experts needed by that forward pass, identify which ones are already resident, submit reads for the cold ones, wait for those reads to complete, bind the completed buffers into slots, and then call the existing KT AMX Base::forward() path.

This keeps the compute path close to KT's original implementation while moving expert residency decisions into an explicit runtime layer.

4. Cache policy and heat-aware residency

MESH keeps a bounded resident set and uses a policy-driven eviction path. The current implementation supports a SIEVE-style base policy and a heat/lookahead signal derived from router scores. The intent is to retain experts that are likely to be reused while allowing cold or low-value experts to be demoted.

The implementation also supports full-gate score observation and skips unnecessary observation when the effective resident capacity already covers all CPU-managed experts.

5. Deferred-expert aware decode behavior

MESH integrates with KT's deferred expert execution. Decode can distinguish resident experts from cold experts and issue prefetches for cold deferred experts. The accounting separates normal hits, cold misses, and in-flight misses so that async prefetch behavior is not mistaken for a fully cold miss.

6. Prefill residency support

MESH includes prefill-specific residency support. The current branch contains a prefill layer-window mode and transition logic back into decode hot-cache mode. This allows prefill and decode to interpret the resident slot pool differently while still sharing the same underlying slot abstraction.

The implementation also supports configurable early-layer residency through KT_MESH_EARLY_LAYER_EXPERTS, because early MoE layers can have different miss behavior from deeper layers.

7. GPU expert compatibility

MESH does not treat GPU experts as CPU resident-slot candidates. CPU-side residency logic checks the actual per-layer GPU expert mask and skips experts that are already assigned to GPU execution.

This matters because GPU experts are not necessarily a fixed prefix of expert ids and may vary by layer or under dynamic expert placement.

Compatibility

MESH is intended to be opt-in through the IOURING backend. The existing mmap/default path should continue to behave as before.

The implementation is designed to preserve KT's existing execution assumptions:

  • AMX compute kernels are not replaced.
  • The quantized expert layout is preserved.
  • Existing CPU/GPU expert separation is respected.
  • Existing KT forward computation is reused after expert residency is resolved.
  • Non-IOURING paths should not depend on MESH residency state.

Current validation status

This work has been validated at two levels.

Automated/unit-level tests

The branch includes unit-level coverage for the async I/O layer, including:

  • basic async reads,
  • batch reads,
  • multiple requests belonging to the same expert,
  • timeout behavior,
  • IO backend enum exposure,
  • Python-to-C++ config conversion,
  • short-read rejection.

The short-read test verifies that a completed-but-incomplete read is not treated as a successful request.

Manual/system-level validation

MESH has also been tested manually on a dual-socket AMX server with Qwen3.5-35B AMXINT4 weights, NVMe-backed expert storage, NUMA execution, and SGLang/KTransformers serving.

The manual validation included:

  • IOURING backend startup and model serving.
  • io_uring direct-read behavior under AMXINT4 expert loading.
  • Deferred-expert decode behavior.
  • hit / cold-miss / in-flight-miss accounting.
  • per-layer and per-token expert hit-rate analysis.
  • prefill behavior under multiple scheduling variants.
  • early-layer full-residency configuration.
  • GPU-expert mask compatibility with CPU resident-slot management.
  • prefill timing and storage pressure analysis with iostat.
  • server-side smoke tests for current MESH code paths.

Some of these experiments are saved under local paper/experiment artifact directories and are not suitable as normal CI tests.

The full MESH prefill/decode system test is not currently included as an automated per-commit test because it is hardware- and model-dependent: it requires AMX-capable CPUs, the AMXINT4 expert weight layout, NVMe-backed weights, NUMA configuration, and a running SGLang/KTransformers server.

This PR should be reviewed as an experimental systems path rather than a fully production-hardened default backend.

Fixes # (issue)

Before submitting

Note: automated tests exist for the async I/O layer. MESH has also been manually validated on the target AMX/NVMe/model-weight infrastructure, but those full system tests are not included in per-commit CI.

pigeonsoup and others added 30 commits February 12, 2026 15:15
- Add platform-conditional triton dependencies (PEP 508 markers):
  triton on Linux/macOS, triton-windows on Windows
- Update kt-kernel/pyproject.toml, kt-kernel/requirements.txt,
  kt-sft/pyproject.toml, and kt-sft/setup.py
- Add Windows OS classifier to kt-kernel and kt-sft

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
On Windows, `sh` is not available, causing the git hooks install
script to fail. Change FATAL_ERROR to WARNING so the build continues.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use file(COPY) and file(GLOB) instead of calling `sh` to install
git hooks, making it work on both Windows and Unix.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add install-git-hooks.bat as Windows equivalent of .sh version
- CMakeLists.txt: use bat on WIN32, sh on Unix

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
On Windows, pkg-config and hwloc are hard to get. CMake now
auto-downloads the official pre-built hwloc binaries and creates
an IMPORTED target, so users don't need to install anything manually.
Linux behavior unchanged (still uses pkg-config).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Windows provides NUMA support via kernel32.dll, so libnuma
(Linux-only) is not needed. Skip the find_library check on WIN32.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- task_queue.cpp: guard unused pthread.h/sched.h includes
- worker_pool.h: guard numa.h, use hwloc fallback for set_to_numa on Win
- worker_pool.cpp: add Windows compat shims (sched_getcpu,
  numa_node_of_cpu, numa_num_configured_nodes), guard pthread_setname_np
- shared_mem_buffer.cpp: use _aligned_malloc/_aligned_free on Windows,
  add Windows compat shims for NUMA functions

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add NOMINMAX compile definition on Windows to prevent windows.h
  min/max macros from conflicting with std::min/std::max (fixes C2589)
- Wrap numa.h/numaif.h includes in moe.hpp with #ifndef _WIN32

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The extension loader was hardcoded to look for .so files (Linux), but
on Windows pybind11 produces .pyd files. Now tries both suffixes on
Windows.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…etup.py

- Set per-config CMAKE_LIBRARY_OUTPUT_DIRECTORY_RELEASE etc. for MSVC
  multi-config generators that put output in Release/ subdirectory
- Replace hardcoded .so with platform-aware suffix (.pyd on Windows)
- Fix multi-variant rename and rglob to handle both .so and .pyd

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The doctor command had hardcoded .so globs for finding kernel extensions.
On Windows, Python extensions use .pyd suffix instead.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nagement

- Dashboard connects to configured SGLang server showing model info,
  VRAM usage breakdown, throughput metrics, and server configuration
- Chat page with conversation history persistence, connection bar for
  quick URL/model switching, and sidebar for managing multiple chats
- Server management with start/stop controls, log viewer, and diagnostics
- Model management with scanning, verification, and HuggingFace download
- Configuration page with server, paths, inference, and advanced settings
- Fix config save: unwrap Vue reactive proxies before Electron IPC calls

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…romotion

Implement Plan B tiered weight caching for MoE experts:
- Tier 0: NUMA-local malloc buffers (~80ns, hot experts promoted via background thread)
- Tier 1: mmap pages resident in RAM (~150ns, OS-managed page cache)
- Tier 2: mmap pages on disk (~100us, cold experts evicted under memory pressure)

C++ (llamafile/moe.hpp):
- Replace flat weight buffers with per-expert pointer vectors (gate/up/down_expert_ptrs_)
- Add baseline pointer arrays for demotion fallback
- load_weights() supports both mmap zero-copy and legacy memcpy modes via use_mmap flag
- promote_expert(): numa_alloc_onnode + memcpy + atomic pointer swap
- demote_expert(): restore baseline pointers + numa_free
- TP_MOE<LLAMA_MOE_TP> forwards promote/demote to all TP instances
- Proper destructor frees NUMA Tier 0 buffers and storage

C++ (amx/moe.hpp):
- Add mmap zero-copy path in load_weights (std::free original buffers, point to mmap)

C++ (ext_bindings.cpp):
- Expose promote_expert/demote_expert/is_expert_promoted via pybind11 (LLAMA_MOE_TP only)
- Expose use_mmap field on GeneralMOEConfig

Python (weight_provider.py - new):
- TieredWeightProvider: orchestrates promote/demote via C++ MOE objects
- ExpertHotnessTracker: EMA-based activation frequency with vectorized np.add.at()
- MmapWeightRegion: zero-copy mmap views with madvise(MADV_WILLNEED) prefetch
- Background promotion thread with lazy start (only when first MOE registered)

Python (experts_base.py):
- Add _tiered_provider singleton and _prev_topk_ids class variables
- submit_forward(): prefetch using previous token's expert IDs (no GPU sync)
- sync_forward(): record activations after CPU sync (not in hot path)
- _register_moe_with_provider(): connect subclass MOE objects to tiered provider

Python (llamafile.py):
- load_weights() dual-mode: tiered (mmap zero-copy) vs legacy (memcpy)
- Register per-expert mmap regions with TieredWeightProvider for prefetch
- Register MOE with provider to enable promote/demote

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- CLI: add --weight-strategy (auto/tiered/legacy) to `kt run` command
- CLI: pass --kt-weight-strategy to sglang engine launch
- CLI: show weight strategy in interactive preview and tuna engine
- Frontend: add Memory-Mapped Weights toggle in server config dialog
- Frontend: pass --weight-strategy flag to kt process spawn
- Add autodl setup script for DeepSeek V3.2

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three fixes for NUMA locality of mmap'd weight pages:

1. mbind in load_weights(): After setting per-expert mmap pointers, call
   mbind(MPOL_BIND | MPOL_MF_MOVE) to bind/migrate pages to the TP
   instance's NUMA node. Without this, pages land on whichever node
   first faults them (typically Python's main thread on node 0).

2. madvise in promote_expert(): Issue MADV_WILLNEED on baseline mmap
   regions before memcpy to trigger async readahead. Reduces synchronous
   page fault stalls (~100us each) during promotion.

3. NUMA-aware promote dispatch: TP_MOE::promote_expert() now spawns
   per-NUMA threads (via set_to_numa) so memcpy executes on the correct
   NUMA node. Previously ran on Python's unbound promotion thread,
   causing cross-NUMA traffic for both source reads and destination writes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add KT_TIER0_MEMORY_GB and KT_MAX_TIER0_EXPERTS env vars
- Fallback mechanism: CLI params not passed to SGLang directly
- BaseMoEWrapper reads env vars when parameters are None
- Avoids modifying third-party SGLang code

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove --kt-weight-strategy, --kt-tier0-memory-gb, --kt-max-tier0-experts from SGLang command
- SGLang doesn't recognize these parameters (would cause startup failure)
- All three parameters now passed via environment variables only
- Add KT_WEIGHT_STRATEGY env var support in BaseMoEWrapper

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Use psutil to detect available system memory
- Default: all available memory minus 4GB safety margin
- Optimized for dedicated inference machines
- User can override with --tier0-memory-gb or --max-tier0-experts

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix C++ compilation error in moe.hpp (typename decltype -> decltype)
- Fix sync_forward signature for sglang integration compatibility
- Add tier0-memory-gb and max-tier0-experts CLI parameters
- Integrate tier0 configuration across MoE kernel modules
…X/moe_kernel backends

- Add promote_expert/demote_expert/is_expert_promoted hooks to AMX TP and moe_kernel C++ classes
- Fix BF16 cache_capacity_=0 when max_tier0_experts<=0 (was incorrectly defaulting to 1)
- Wire max_tier0_experts into AMX moe_config
- Add mmap region registration for AMX and moe_kernel backends
- Add cgroup v1/v2-aware memory detection in weight_provider.py
- Add AMXINT4/AMXINT8/MOE_INT4/MOE_INT8 to _provider_backends in experts_base.py
- Fix pre-commit hook shebang and mapfile for macOS bash 3.2
- All 11 unit tests passing
- Merged upstream changes including GPTQ expert loading support
- Resolved conflict in loader.py: kept both our mmap methods and upstream GPTQSafeTensorLoader.load_experts()
Sprint 1: io_uring infrastructure
- Add AsyncExpertReader class (cpu_backend/async_io.hpp/cpp)
- Wrap io_uring for async expert weight loading from SSD
- Support submit_read, wait_one_completion, poll_completions, wait_for_expert
- CMakeLists.txt: detect liburing and link if available

Sprint 2: AMX integration and cache stats
- common.hpp: add IOBackend enum, ExpertFileSlot, ExpertCacheStats
- amx/moe.hpp: add io_uring branch in allocate_and_copy_expert()
- amx/moe.hpp: add cache hit/miss/promote/demote stats in ensure_expert_ready()
- Conditional compilation with HAVE_LIBURING for Linux-only support

Key benefits:
- Eliminate OS page cache dependency (no double-occupancy)
- Predictable I/O latency with completion notifications
- Cache hit rate instrumentation for policy comparison

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add Python-side support for io_uring-based expert loading:

1. Python bindings (ext_bindings.cpp):
   - Expose AsyncExpertReader class to Python
   - Expose IOBackend enum (MMAP, IOURING)

2. Utility modules:
   - async_io_manager.py: Global AsyncExpertReader singleton
   - loader.py: load_experts_iouring() for file descriptor extraction

3. CLI integration (run.py):
   - Add --io-backend option (mmap/iouring)
   - Add --enable-cache-stats flag
   - Pass via KT_IO_BACKEND and KT_ENABLE_CACHE_STATS env vars

4. Configuration (experts_base.py):
   - Read io_backend and enable_cache_stats from environment
   - Store in wrapper config for C++ consumption

5. Unit tests (test_async_io.py):
   - Test AsyncExpertReader basic read
   - Test batch reading with multiple experts
   - Test timeout behavior
   - Test IOBackend enum

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
RaQiu and others added 20 commits May 6, 2026 22:10
…arameter

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Fix BLOCK_SIZE macro conflict with liburing by undefining after include
- Fix AsyncExpertReader Python binding syntax (add missing semicolon)
- Add completed_experts_ tracking to distinguish completed vs never-submitted experts
- Update wait_for_expert() to return false for never-submitted experts (fixes timeout test)
- Remove num_workers parameter from test calls (already removed from C++ constructor)
- Remove shutdown() method calls from tests (method doesn't exist in C++)
- Fix test import path to use kt_kernel_ext from build directory

All async_io unit tests now pass (4/4 passed)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new Electron-based frontend for KTransformers, providing a GUI for model management, server control, and system monitoring. It also includes significant backend enhancements, such as io_uring support for expert weight loading, NUMA-aware memory management, and a tiered weight residency strategy to optimize performance when model sizes exceed physical RAM. Several security and performance issues were identified in the review, including insecure Electron settings, command injection risks in IPC handlers, and inefficient resource management in the hot path of the inference engine.

Comment thread kt-frontend/electron/main.ts Outdated
preload: join(__dirname, 'preload.js'),
contextIsolation: true,
nodeIntegration: false,
webSecurity: false // allow cross-origin requests to remote API servers
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

Disabling webSecurity is a significant security risk in Electron applications as it bypasses the Same-Origin Policy. This allows the renderer process to make requests to any domain, which can be exploited if the application ever loads untrusted content. It is recommended to keep webSecurity enabled and handle cross-origin requirements via proper CORS configuration on the server or by using a proxy in the main process.

Comment thread kt-frontend/electron/ipc/handlers.ts Outdated
if (model) args.push('--model', model)
args.push('--output', tmpFile)
return new Promise((resolve) => {
const proc = spawn('kt', args, { shell: true })
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

Using shell: true with spawn can lead to command injection vulnerabilities if any part of the arguments is user-controlled. Since args is already an array, shell: true is unnecessary and should be set to false (which is the default) to execute the command directly without a shell. This improvement should be applied to all spawn calls in this file and other service files where shell: true is used.

Suggested change
const proc = spawn('kt', args, { shell: true })
const proc = spawn('kt', args, { shell: false })

Comment thread kt-kernel/operators/amx/bf16-moe.hpp Outdated
Comment on lines +739 to +740
auto accum_output_holder = alloc_aligned_f32(output_elems);
auto wave_output_holder = alloc_aligned_f32(output_elems);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Performing heap allocations (posix_memalign or _aligned_malloc) inside the forward method for every prefill pass is a significant performance bottleneck and can lead to memory fragmentation. These buffers should be pre-allocated during initialization or managed through a reusable memory pool to ensure optimal performance in the hot path.

Comment thread kt-kernel/operators/common.hpp Outdated
Comment on lines +171 to +211
void configure_experts(int layer, int n) {
if (n <= 0) return;
std::lock_guard<std::mutex> guard(expert_mu);
if (layer_idx == layer && expert_num == n && expert_access_count != nullptr) {
return;
}
layer_idx = layer;
expert_num = n;
dump_path.clear();
if (const char* path = std::getenv("KT_EXPERT_STATS_PATH")) {
dump_path = path;
}
dump_every = 1;
if (const char* raw_every = std::getenv("KT_EXPERT_STATS_DUMP_EVERY")) {
char* end = nullptr;
const unsigned long long parsed = std::strtoull(raw_every, &end, 10);
if (end != raw_every && parsed > 0) {
dump_every = static_cast<uint64_t>(parsed);
}
}
expert_access_count = std::make_unique<std::atomic<uint64_t>[]>(n);
expert_hit_count = std::make_unique<std::atomic<uint64_t>[]>(n);
expert_miss_count = std::make_unique<std::atomic<uint64_t>[]>(n);
expert_cold_miss_count = std::make_unique<std::atomic<uint64_t>[]>(n);
expert_in_flight_miss_count = std::make_unique<std::atomic<uint64_t>[]>(n);
expert_promote_count = std::make_unique<std::atomic<uint64_t>[]>(n);
expert_prefetch_hit_count = std::make_unique<std::atomic<uint64_t>[]>(n);
for (int i = 0; i < n; ++i) {
expert_access_count[i].store(0, std::memory_order_relaxed);
expert_hit_count[i].store(0, std::memory_order_relaxed);
expert_miss_count[i].store(0, std::memory_order_relaxed);
expert_cold_miss_count[i].store(0, std::memory_order_relaxed);
expert_in_flight_miss_count[i].store(0, std::memory_order_relaxed);
expert_promote_count[i].store(0, std::memory_order_relaxed);
expert_prefetch_hit_count[i].store(0, std::memory_order_relaxed);
}
}

void note_expert_access(int expert_id) {
if (expert_id < 0 || expert_id >= expert_num || expert_access_count == nullptr) return;
expert_access_count[expert_id].fetch_add(1, std::memory_order_relaxed);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is a potential race condition in ExpertCacheStats. The configure_experts method reallocates the expert_access_count array while holding expert_mu, but the note_expert_access method (and others like note_expert_hit) accesses this array without any synchronization. If configure_experts is called while inference threads are active, it could lead to use-after-free or out-of-bounds access. Consider using a synchronization mechanism that protects the array pointer itself during access, or ensure that configuration only happens when no inference is running.

Comment thread kt-frontend/electron/ipc/handlers.ts Outdated
return new Promise((resolve) => {
let output = ''
let errOutput = ''
const proc = spawn('python3', [scriptPath], { shell: false })
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Hardcoding python3 may cause the application to fail on Windows systems where the executable is typically named python. Consider detecting the platform or using a configurable path for the Python interpreter.

Suggested change
const proc = spawn('python3', [scriptPath], { shell: false })
const proc = spawn(process.platform === 'win32' ? 'python' : 'python3', [scriptPath], { shell: false })

Comment thread kt-kernel/operators/amx/bf16-moe.hpp Outdated
}

const bool enable_wave_mode = []() {
const char* raw = std::getenv("KT_ENABLE_BF16_WAVE_RESIDENT");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Reading environment variables using std::getenv inside the forward method is inefficient as it is called on every inference pass. These values should be read once during initialization and stored as member variables to avoid repeated string lookups in the hot path.

Comment thread kt-kernel/operators/llamafile/moe.hpp Outdated
static float act_fn(float x) { return x / (1.0f + expf(-x)); }

static inline bool use_row_dot_debug_fallback() {
const char* env = std::getenv("KT_LLAMA_USE_ROW_DOT");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Calling std::getenv in the hot path (forward_one) should be avoided. It is recommended to cache the value of environment variables during initialization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants