[Feature] Add MESH expert residency with io_uring direct I/O for MoE inference by RaQiu · Pull Request #2003 · kvcache-ai/ktransformers

RaQiu · 2026-05-13T12:17:37Z

What does this PR do?

This PR introduces MESH, an experimental memory-tiered expert residency system for KTransformers MoE inference.

MESH is designed for heterogeneous CPU-GPU MoE serving when the full expert working set cannot stay comfortably resident in host DRAM. The current KT path works well when expert weights are available from normal memory-backed storage, but constrained-memory deployments expose a difficult systems problem: expert weights are sparse at runtime, but the full expert set must remain accessible. Relying only on mmap leaves expert residency mostly under OS page-cache control.

MESH adds an explicit runtime-managed expert residency layer for AMXINT4 MoE inference. It can load expert weights from NVMe through io_uring + O_DIRECT into NUMA-local CPU buffers, manage a bounded resident expert slot pool, and preserve the existing KT AMX compute path once the required experts are resident.

The goal is not to replace KT's AMX kernels. The goal is to make expert residency explicit: which experts are in CPU memory, which experts are cold, when cold experts are read, and how prefill/decode should share the resident slot pool.

Why this is useful

On constrained-memory machines, mmap makes MoE expert residency hard to control:

The OS page cache can hold a hidden copy of expert weights.
NUMA-local CPU execution may require a second application-managed copy.
Page-cache reclaim timing is not visible to the runtime.
Page faults happen synchronously on the critical path.
The runtime cannot directly express that some experts should remain hot while others can be evicted.

MESH gives KTransformers a runtime-level mechanism to manage this explicitly:

expert weights can be read directly into NUMA-local buffers,
resident experts are tracked by layer and expert id,
cold expert reads are issued through async io_uring,
promotion and demotion are visible to the runtime,
GPU-resident experts remain outside CPU resident-slot management,
existing AMX compute kernels are reused after residency is resolved.

This is useful for local single-node MoE serving, workstation deployments, and memory-constrained CPU-GPU systems where CPU AMX, GPU attention, NUMA locality, and NVMe-backed expert storage need to work together.

Main components

1. `io_uring` direct-I/O expert loading

MESH adds an io_uring-based loading path for expert weights. With O_DIRECT, expert tensors can be read from storage directly into application-owned buffers, avoiding the OS page cache on the MESH path.

The async reader tracks request completion explicitly and validates read results. It also includes retry/validation logic for failed or incomplete reads.

2. NUMA-local resident expert slots

MESH adds a resident slot pool for CPU-managed experts. Each resident expert is associated with slot-owned buffers for its expert tensors. The slot metadata tracks expert state, slot state, and active readers so that a resident expert is not evicted while it is being used by the AMX forward path.

The slot pool is virtual: a slot is not permanently tied to a fixed expert id. It can be rebound to different experts as promotion and eviction happen.

3. Batched cold expert promotion

For a layer forward, MESH can collect the CPU experts needed by that forward pass, identify which ones are already resident, submit reads for the cold ones, wait for those reads to complete, bind the completed buffers into slots, and then call the existing KT AMX Base::forward() path.

This keeps the compute path close to KT's original implementation while moving expert residency decisions into an explicit runtime layer.

4. Cache policy and heat-aware residency

MESH keeps a bounded resident set and uses a policy-driven eviction path. The current implementation supports a SIEVE-style base policy and a heat/lookahead signal derived from router scores. The intent is to retain experts that are likely to be reused while allowing cold or low-value experts to be demoted.

The implementation also supports full-gate score observation and skips unnecessary observation when the effective resident capacity already covers all CPU-managed experts.

5. Deferred-expert aware decode behavior

MESH integrates with KT's deferred expert execution. Decode can distinguish resident experts from cold experts and issue prefetches for cold deferred experts. The accounting separates normal hits, cold misses, and in-flight misses so that async prefetch behavior is not mistaken for a fully cold miss.

6. Prefill residency support

MESH includes prefill-specific residency support. The current branch contains a prefill layer-window mode and transition logic back into decode hot-cache mode. This allows prefill and decode to interpret the resident slot pool differently while still sharing the same underlying slot abstraction.

The implementation also supports configurable early-layer residency through KT_MESH_EARLY_LAYER_EXPERTS, because early MoE layers can have different miss behavior from deeper layers.

7. GPU expert compatibility

MESH does not treat GPU experts as CPU resident-slot candidates. CPU-side residency logic checks the actual per-layer GPU expert mask and skips experts that are already assigned to GPU execution.

This matters because GPU experts are not necessarily a fixed prefix of expert ids and may vary by layer or under dynamic expert placement.

Compatibility

MESH is intended to be opt-in through the IOURING backend. The existing mmap/default path should continue to behave as before.

The implementation is designed to preserve KT's existing execution assumptions:

AMX compute kernels are not replaced.
The quantized expert layout is preserved.
Existing CPU/GPU expert separation is respected.
Existing KT forward computation is reused after expert residency is resolved.
Non-IOURING paths should not depend on MESH residency state.

Current validation status

This work has been validated at two levels.

Automated/unit-level tests

The branch includes unit-level coverage for the async I/O layer, including:

basic async reads,
batch reads,
multiple requests belonging to the same expert,
timeout behavior,
IO backend enum exposure,
Python-to-C++ config conversion,
short-read rejection.

The short-read test verifies that a completed-but-incomplete read is not treated as a successful request.

Manual/system-level validation

MESH has also been tested manually on a dual-socket AMX server with Qwen3.5-35B AMXINT4 weights, NVMe-backed expert storage, NUMA execution, and SGLang/KTransformers serving.

The manual validation included:

IOURING backend startup and model serving.
io_uring direct-read behavior under AMXINT4 expert loading.
Deferred-expert decode behavior.
hit / cold-miss / in-flight-miss accounting.
per-layer and per-token expert hit-rate analysis.
prefill behavior under multiple scheduling variants.
early-layer full-residency configuration.
GPU-expert mask compatibility with CPU resident-slot management.
prefill timing and storage pressure analysis with iostat.
server-side smoke tests for current MESH code paths.

Some of these experiments are saved under local paper/experiment artifact directories and are not suitable as normal CI tests.

The full MESH prefill/decode system test is not currently included as an automated per-commit test because it is hardware- and model-dependent: it requires AMX-capable CPUs, the AMXINT4 expert weight layout, NVMe-backed weights, NUMA configuration, and a running SGLang/KTransformers server.

This PR should be reviewed as an experimental systems path rather than a fully production-hardened default backend.

Fixes # (issue)

Before submitting

Did you read the contributor guideline?
Did you write any new necessary tests?

Note: automated tests exist for the async I/O layer. MESH has also been manually validated on the target AMX/NVMe/model-weight infrastructure, but those full system tests are not included in per-commit CI.

- Add platform-conditional triton dependencies (PEP 508 markers): triton on Linux/macOS, triton-windows on Windows - Update kt-kernel/pyproject.toml, kt-kernel/requirements.txt, kt-sft/pyproject.toml, and kt-sft/setup.py - Add Windows OS classifier to kt-kernel and kt-sft Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

On Windows, `sh` is not available, causing the git hooks install script to fail. Change FATAL_ERROR to WARNING so the build continues. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Use file(COPY) and file(GLOB) instead of calling `sh` to install git hooks, making it work on both Windows and Unix. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

This reverts commit 7511365.

- Add install-git-hooks.bat as Windows equivalent of .sh version - CMakeLists.txt: use bat on WIN32, sh on Unix Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

On Windows, pkg-config and hwloc are hard to get. CMake now auto-downloads the official pre-built hwloc binaries and creates an IMPORTED target, so users don't need to install anything manually. Linux behavior unchanged (still uses pkg-config). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Windows provides NUMA support via kernel32.dll, so libnuma (Linux-only) is not needed. Skip the find_library check on WIN32. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- task_queue.cpp: guard unused pthread.h/sched.h includes - worker_pool.h: guard numa.h, use hwloc fallback for set_to_numa on Win - worker_pool.cpp: add Windows compat shims (sched_getcpu, numa_node_of_cpu, numa_num_configured_nodes), guard pthread_setname_np - shared_mem_buffer.cpp: use _aligned_malloc/_aligned_free on Windows, add Windows compat shims for NUMA functions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add NOMINMAX compile definition on Windows to prevent windows.h min/max macros from conflicting with std::min/std::max (fixes C2589) - Wrap numa.h/numaif.h includes in moe.hpp with #ifndef _WIN32 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The extension loader was hardcoded to look for .so files (Linux), but on Windows pybind11 produces .pyd files. Now tries both suffixes on Windows. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…etup.py - Set per-config CMAKE_LIBRARY_OUTPUT_DIRECTORY_RELEASE etc. for MSVC multi-config generators that put output in Release/ subdirectory - Replace hardcoded .so with platform-aware suffix (.pyd on Windows) - Fix multi-variant rename and rglob to handle both .so and .pyd Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The doctor command had hardcoded .so globs for finding kernel extensions. On Windows, Python extensions use .pyd suffix instead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…nagement - Dashboard connects to configured SGLang server showing model info, VRAM usage breakdown, throughput metrics, and server configuration - Chat page with conversation history persistence, connection bar for quick URL/model switching, and sidebar for managing multiple chats - Server management with start/stop controls, log viewer, and diagnostics - Model management with scanning, verification, and HuggingFace download - Configuration page with server, paths, inference, and advanced settings - Fix config save: unwrap Vue reactive proxies before Electron IPC calls Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…romotion Implement Plan B tiered weight caching for MoE experts: - Tier 0: NUMA-local malloc buffers (~80ns, hot experts promoted via background thread) - Tier 1: mmap pages resident in RAM (~150ns, OS-managed page cache) - Tier 2: mmap pages on disk (~100us, cold experts evicted under memory pressure) C++ (llamafile/moe.hpp): - Replace flat weight buffers with per-expert pointer vectors (gate/up/down_expert_ptrs_) - Add baseline pointer arrays for demotion fallback - load_weights() supports both mmap zero-copy and legacy memcpy modes via use_mmap flag - promote_expert(): numa_alloc_onnode + memcpy + atomic pointer swap - demote_expert(): restore baseline pointers + numa_free - TP_MOE<LLAMA_MOE_TP> forwards promote/demote to all TP instances - Proper destructor frees NUMA Tier 0 buffers and storage C++ (amx/moe.hpp): - Add mmap zero-copy path in load_weights (std::free original buffers, point to mmap) C++ (ext_bindings.cpp): - Expose promote_expert/demote_expert/is_expert_promoted via pybind11 (LLAMA_MOE_TP only) - Expose use_mmap field on GeneralMOEConfig Python (weight_provider.py - new): - TieredWeightProvider: orchestrates promote/demote via C++ MOE objects - ExpertHotnessTracker: EMA-based activation frequency with vectorized np.add.at() - MmapWeightRegion: zero-copy mmap views with madvise(MADV_WILLNEED) prefetch - Background promotion thread with lazy start (only when first MOE registered) Python (experts_base.py): - Add _tiered_provider singleton and _prev_topk_ids class variables - submit_forward(): prefetch using previous token's expert IDs (no GPU sync) - sync_forward(): record activations after CPU sync (not in hot path) - _register_moe_with_provider(): connect subclass MOE objects to tiered provider Python (llamafile.py): - load_weights() dual-mode: tiered (mmap zero-copy) vs legacy (memcpy) - Register per-expert mmap regions with TieredWeightProvider for prefetch - Register MOE with provider to enable promote/demote Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- CLI: add --weight-strategy (auto/tiered/legacy) to `kt run` command - CLI: pass --kt-weight-strategy to sglang engine launch - CLI: show weight strategy in interactive preview and tuna engine - Frontend: add Memory-Mapped Weights toggle in server config dialog - Frontend: pass --weight-strategy flag to kt process spawn - Add autodl setup script for DeepSeek V3.2 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Three fixes for NUMA locality of mmap'd weight pages: 1. mbind in load_weights(): After setting per-expert mmap pointers, call mbind(MPOL_BIND | MPOL_MF_MOVE) to bind/migrate pages to the TP instance's NUMA node. Without this, pages land on whichever node first faults them (typically Python's main thread on node 0). 2. madvise in promote_expert(): Issue MADV_WILLNEED on baseline mmap regions before memcpy to trigger async readahead. Reduces synchronous page fault stalls (~100us each) during promotion. 3. NUMA-aware promote dispatch: TP_MOE::promote_expert() now spawns per-NUMA threads (via set_to_numa) so memcpy executes on the correct NUMA node. Previously ran on Python's unbound promotion thread, causing cross-NUMA traffic for both source reads and destination writes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add KT_TIER0_MEMORY_GB and KT_MAX_TIER0_EXPERTS env vars - Fallback mechanism: CLI params not passed to SGLang directly - BaseMoEWrapper reads env vars when parameters are None - Avoids modifying third-party SGLang code Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Remove --kt-weight-strategy, --kt-tier0-memory-gb, --kt-max-tier0-experts from SGLang command - SGLang doesn't recognize these parameters (would cause startup failure) - All three parameters now passed via environment variables only - Add KT_WEIGHT_STRATEGY env var support in BaseMoEWrapper Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Use psutil to detect available system memory - Default: all available memory minus 4GB safety margin - Optimized for dedicated inference machines - User can override with --tier0-memory-gb or --max-tier0-experts Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Fix C++ compilation error in moe.hpp (typename decltype -> decltype) - Fix sync_forward signature for sglang integration compatibility

- Add tier0-memory-gb and max-tier0-experts CLI parameters - Integrate tier0 configuration across MoE kernel modules

…X/moe_kernel backends - Add promote_expert/demote_expert/is_expert_promoted hooks to AMX TP and moe_kernel C++ classes - Fix BF16 cache_capacity_=0 when max_tier0_experts<=0 (was incorrectly defaulting to 1) - Wire max_tier0_experts into AMX moe_config - Add mmap region registration for AMX and moe_kernel backends - Add cgroup v1/v2-aware memory detection in weight_provider.py - Add AMXINT4/AMXINT8/MOE_INT4/MOE_INT8 to _provider_backends in experts_base.py - Fix pre-commit hook shebang and mapfile for macOS bash 3.2 - All 11 unit tests passing

- Merged upstream changes including GPTQ expert loading support - Resolved conflict in loader.py: kept both our mmap methods and upstream GPTQSafeTensorLoader.load_experts()

Sprint 1: io_uring infrastructure - Add AsyncExpertReader class (cpu_backend/async_io.hpp/cpp) - Wrap io_uring for async expert weight loading from SSD - Support submit_read, wait_one_completion, poll_completions, wait_for_expert - CMakeLists.txt: detect liburing and link if available Sprint 2: AMX integration and cache stats - common.hpp: add IOBackend enum, ExpertFileSlot, ExpertCacheStats - amx/moe.hpp: add io_uring branch in allocate_and_copy_expert() - amx/moe.hpp: add cache hit/miss/promote/demote stats in ensure_expert_ready() - Conditional compilation with HAVE_LIBURING for Linux-only support Key benefits: - Eliminate OS page cache dependency (no double-occupancy) - Predictable I/O latency with completion notifications - Cache hit rate instrumentation for policy comparison Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Add Python-side support for io_uring-based expert loading: 1. Python bindings (ext_bindings.cpp): - Expose AsyncExpertReader class to Python - Expose IOBackend enum (MMAP, IOURING) 2. Utility modules: - async_io_manager.py: Global AsyncExpertReader singleton - loader.py: load_experts_iouring() for file descriptor extraction 3. CLI integration (run.py): - Add --io-backend option (mmap/iouring) - Add --enable-cache-stats flag - Pass via KT_IO_BACKEND and KT_ENABLE_CACHE_STATS env vars 4. Configuration (experts_base.py): - Read io_backend and enable_cache_stats from environment - Store in wrapper config for C++ consumption 5. Unit tests (test_async_io.py): - Test AsyncExpertReader basic read - Test batch reading with multiple experts - Test timeout behavior - Test IOBackend enum Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…arameter Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- Fix BLOCK_SIZE macro conflict with liburing by undefining after include - Fix AsyncExpertReader Python binding syntax (add missing semicolon) - Add completed_experts_ tracking to distinguish completed vs never-submitted experts - Update wait_for_expert() to return false for never-submitted experts (fixes timeout test) - Remove num_workers parameter from test calls (already removed from C++ constructor) - Remove shutdown() method calls from tests (method doesn't exist in C++) - Fix test import path to use kt_kernel_ext from build directory All async_io unit tests now pass (4/4 passed) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request introduces a new Electron-based frontend for KTransformers, providing a GUI for model management, server control, and system monitoring. It also includes significant backend enhancements, such as io_uring support for expert weight loading, NUMA-aware memory management, and a tiered weight residency strategy to optimize performance when model sizes exceed physical RAM. Several security and performance issues were identified in the review, including insecure Electron settings, command injection risks in IPC handlers, and inefficient resource management in the hot path of the inference engine.

gemini-code-assist · 2026-05-13T12:22:20Z

+      preload: join(__dirname, 'preload.js'),
+      contextIsolation: true,
+      nodeIntegration: false,
+      webSecurity: false  // allow cross-origin requests to remote API servers


Disabling webSecurity is a significant security risk in Electron applications as it bypasses the Same-Origin Policy. This allows the renderer process to make requests to any domain, which can be exploited if the application ever loads untrusted content. It is recommended to keep webSecurity enabled and handle cross-origin requirements via proper CORS configuration on the server or by using a proxy in the main process.

gemini-code-assist · 2026-05-13T12:22:20Z

+    if (model) args.push('--model', model)
+    args.push('--output', tmpFile)
+    return new Promise((resolve) => {
+      const proc = spawn('kt', args, { shell: true })


Using shell: true with spawn can lead to command injection vulnerabilities if any part of the arguments is user-controlled. Since args is already an array, shell: true is unnecessary and should be set to false (which is the default) to execute the command directly without a shell. This improvement should be applied to all spawn calls in this file and other service files where shell: true is used.

Suggested change

const proc = spawn('kt', args, { shell: true })

const proc = spawn('kt', args, { shell: false })

gemini-code-assist · 2026-05-13T12:22:20Z

+        auto accum_output_holder = alloc_aligned_f32(output_elems);
+        auto wave_output_holder = alloc_aligned_f32(output_elems);


Performing heap allocations (posix_memalign or _aligned_malloc) inside the forward method for every prefill pass is a significant performance bottleneck and can lead to memory fragmentation. These buffers should be pre-allocated during initialization or managed through a reusable memory pool to ensure optimal performance in the hot path.

gemini-code-assist · 2026-05-13T12:22:20Z

+  void configure_experts(int layer, int n) {
+    if (n <= 0) return;
+    std::lock_guard<std::mutex> guard(expert_mu);
+    if (layer_idx == layer && expert_num == n && expert_access_count != nullptr) {
+      return;
+    }
+    layer_idx = layer;
+    expert_num = n;
+    dump_path.clear();
+    if (const char* path = std::getenv("KT_EXPERT_STATS_PATH")) {
+      dump_path = path;
+    }
+    dump_every = 1;
+    if (const char* raw_every = std::getenv("KT_EXPERT_STATS_DUMP_EVERY")) {
+      char* end = nullptr;
+      const unsigned long long parsed = std::strtoull(raw_every, &end, 10);
+      if (end != raw_every && parsed > 0) {
+        dump_every = static_cast<uint64_t>(parsed);
+      }
+    }
+    expert_access_count = std::make_unique<std::atomic<uint64_t>[]>(n);
+    expert_hit_count = std::make_unique<std::atomic<uint64_t>[]>(n);
+    expert_miss_count = std::make_unique<std::atomic<uint64_t>[]>(n);
+    expert_cold_miss_count = std::make_unique<std::atomic<uint64_t>[]>(n);
+    expert_in_flight_miss_count = std::make_unique<std::atomic<uint64_t>[]>(n);
+    expert_promote_count = std::make_unique<std::atomic<uint64_t>[]>(n);
+    expert_prefetch_hit_count = std::make_unique<std::atomic<uint64_t>[]>(n);
+    for (int i = 0; i < n; ++i) {
+      expert_access_count[i].store(0, std::memory_order_relaxed);
+      expert_hit_count[i].store(0, std::memory_order_relaxed);
+      expert_miss_count[i].store(0, std::memory_order_relaxed);
+      expert_cold_miss_count[i].store(0, std::memory_order_relaxed);
+      expert_in_flight_miss_count[i].store(0, std::memory_order_relaxed);
+      expert_promote_count[i].store(0, std::memory_order_relaxed);
+      expert_prefetch_hit_count[i].store(0, std::memory_order_relaxed);
+    }
+  }
+
+  void note_expert_access(int expert_id) {
+    if (expert_id < 0 || expert_id >= expert_num || expert_access_count == nullptr) return;
+    expert_access_count[expert_id].fetch_add(1, std::memory_order_relaxed);


There is a potential race condition in ExpertCacheStats. The configure_experts method reallocates the expert_access_count array while holding expert_mu, but the note_expert_access method (and others like note_expert_hit) accesses this array without any synchronization. If configure_experts is called while inference threads are active, it could lead to use-after-free or out-of-bounds access. Consider using a synchronization mechanism that protects the array pointer itself during access, or ensure that configuration only happens when no inference is running.

gemini-code-assist · 2026-05-13T12:22:20Z

+    return new Promise((resolve) => {
+      let output = ''
+      let errOutput = ''
+      const proc = spawn('python3', [scriptPath], { shell: false })


Hardcoding python3 may cause the application to fail on Windows systems where the executable is typically named python. Consider detecting the platform or using a configurable path for the Python interpreter.

Suggested change

const proc = spawn('python3', [scriptPath], { shell: false })

const proc = spawn(process.platform === 'win32' ? 'python' : 'python3', [scriptPath], { shell: false })

gemini-code-assist · 2026-05-13T12:22:20Z

+      }
+
+      const bool enable_wave_mode = []() {
+        const char* raw = std::getenv("KT_ENABLE_BF16_WAVE_RESIDENT");


Reading environment variables using std::getenv inside the forward method is inefficient as it is called on every inference pass. These values should be read once during initialization and stored as member variables to avoid repeated string lookups in the hot path.

gemini-code-assist · 2026-05-13T12:22:20Z

  static float act_fn(float x) { return x / (1.0f + expf(-x)); }

+  static inline bool use_row_dot_debug_fallback() {
+    const char* env = std::getenv("KT_LLAMA_USE_ROW_DOT");


Calling std::getenv in the hot path (forward_one) should be avoided. It is recommended to cache the value of environment variables during initialization.

pigeonsoup and others added 30 commits February 12, 2026 15:15

support qwen3.5

240237b

merge main

1eafb76

fix: downgrade git hooks install failure from FATAL_ERROR to WARNING

4572888

On Windows, `sh` is not available, causing the git hooks install script to fail. Change FATAL_ERROR to WARNING so the build continues. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: replace shell-based git hooks install with pure CMake

5351e51

Use file(COPY) and file(GLOB) instead of calling `sh` to install git hooks, making it work on both Windows and Unix. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Revert "fix: replace shell-based git hooks install with pure CMake"

af47320

This reverts commit 7511365.

fix: add Windows bat script for git hooks installation

57f7ba0

- Add install-git-hooks.bat as Windows equivalent of .sh version - CMakeLists.txt: use bat on WIN32, sh on Unix Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: skip libnuma on Windows, use built-in NUMA API

ee49e8d

Windows provides NUMA support via kernel32.dll, so libnuma (Linux-only) is not needed. Skip the find_library check on WIN32. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

[fix] Windows: support .pyd extension in CPU variant loader

89c98a2

The extension loader was hardcoded to look for .so files (Linux), but on Windows pybind11 produces .pyd files. Now tries both suffixes on Windows. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix(windows): support .pyd extension in doctor.py diagnostics

8c50160

The doctor command had hardcoded .so globs for finding kernel extensions. On Windows, Python extensions use .pyd suffix instead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

[fix]: C++ compilation and sglang backward compatibility

e5a8ad7

- Fix C++ compilation error in moe.hpp (typename decltype -> decltype) - Fix sync_forward signature for sglang integration compatibility

feat: add tier0 memory management parameter support

b1b16c2

- Add tier0-memory-gb and max-tier0-experts CLI parameters - Integrate tier0 configuration across MoE kernel modules

kt-kernel: enable bf16 tiered mmap lazy packing

46c76a3

[fix]: improve native conversion and weight runtime

e4e0ca5

[feat]: merge upstream kvcache-ai/ktransformers main

4b7f701

- Merged upstream changes including GPTQ expert loading support - Resolved conflict in loader.py: kept both our mmap methods and upstream GPTQSafeTensorLoader.load_experts()

[merge]: upstream/qwen3.5 - Qwen3.5 MoE packed format support

69ae48b

RaQiu and others added 20 commits May 6, 2026 22:10

[fix]: AsyncExpertReader constructor signature - remove num_workers p…

3740081

…arameter Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

[feat]: wire AMX io_uring file-slot backend

5b758e8

[fix]: accept NUMA nodes in KTMoE wrapper

896de01

[feat]: add residency policy selection for tiered weights

cba7091

[fix]: detect Qwen3.5 unfused BF16 experts

4892595

[fix]: skip tiered provider for moe without promotion hooks

1c96c53

[fix]: fallback when cpuinfer stream hooks are unavailable

1bdfaac

[feat]: complete AMX io_uring resident backend

b6d0d29

[feat]: add safetensors direct I/O aligner

4c3d3cf

[fix]: support Qwen3.5 expert key conversion

00a2434

[fix]: align safetensors header without strict payload check

cce6b09

[fix]: tolerate safetensors handles without close

dfded88

[fix]: harden iouring promotion concurrency

ddea72e

[feat](mesh): implement slot residency and iouring prefetch

39894eb

[fix](mesh): skip full-gate observe during cuda graph capture

3d95b6c

[chore]: merge upstream main

6ff1396

[feat](mesh): restore deferred miss instrumentation

11ffbeb

[fix](mesh): guard resident io by buffer layout

61c8c56

[feat](mesh): add prefill layer-window residency mode

b3a21c1

gemini-code-assist Bot reviewed May 13, 2026

View reviewed changes

RaQiu added 7 commits May 14, 2026 10:10

[feat](mesh): restore prefill expert-window path

e23b632

[refactor](mesh): extract mesh runtime helpers

7c9b768

[fix](mesh): clean extracted include endings

778e7af

[chore](mesh): move windows and desktop work off main

15ddab0

[refactor](mesh): remove mmap residency leftovers

d7f7e66

fix(mesh): skip score defer during prefill

fb1a829

fix(mesh): chunk prefill scratch promotions

08bdfc2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Add MESH expert residency with io_uring direct I/O for MoE inference#2003

[Feature] Add MESH expert residency with io_uring direct I/O for MoE inference#2003
RaQiu wants to merge 57 commits into
kvcache-ai:mainfrom
RaQiu:main

RaQiu commented May 13, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 13, 2026

Uh oh!

gemini-code-assist Bot May 13, 2026

Uh oh!

gemini-code-assist Bot May 13, 2026

Uh oh!

gemini-code-assist Bot May 13, 2026

Uh oh!

gemini-code-assist Bot May 13, 2026

Uh oh!

gemini-code-assist Bot May 13, 2026

Uh oh!

gemini-code-assist Bot May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	const proc = spawn('kt', args, { shell: true })
	const proc = spawn('kt', args, { shell: false })

		auto accum_output_holder = alloc_aligned_f32(output_elems);
		auto wave_output_holder = alloc_aligned_f32(output_elems);

	const proc = spawn('python3', [scriptPath], { shell: false })
	const proc = spawn(process.platform === 'win32' ? 'python' : 'python3', [scriptPath], { shell: false })

Conversation

RaQiu commented May 13, 2026

What does this PR do?

Why this is useful

Main components

1. io_uring direct-I/O expert loading

2. NUMA-local resident expert slots

3. Batched cold expert promotion

4. Cache policy and heat-aware residency

5. Deferred-expert aware decode behavior

6. Prefill residency support

7. GPU expert compatibility

Compatibility

Current validation status

Automated/unit-level tests

Manual/system-level validation

Before submitting

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. `io_uring` direct-I/O expert loading