[Feature] Add MESH expert residency with io_uring direct I/O for MoE inference#2003
[Feature] Add MESH expert residency with io_uring direct I/O for MoE inference#2003RaQiu wants to merge 57 commits into
Conversation
- Add platform-conditional triton dependencies (PEP 508 markers): triton on Linux/macOS, triton-windows on Windows - Update kt-kernel/pyproject.toml, kt-kernel/requirements.txt, kt-sft/pyproject.toml, and kt-sft/setup.py - Add Windows OS classifier to kt-kernel and kt-sft Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
On Windows, `sh` is not available, causing the git hooks install script to fail. Change FATAL_ERROR to WARNING so the build continues. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use file(COPY) and file(GLOB) instead of calling `sh` to install git hooks, making it work on both Windows and Unix. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This reverts commit 7511365.
- Add install-git-hooks.bat as Windows equivalent of .sh version - CMakeLists.txt: use bat on WIN32, sh on Unix Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
On Windows, pkg-config and hwloc are hard to get. CMake now auto-downloads the official pre-built hwloc binaries and creates an IMPORTED target, so users don't need to install anything manually. Linux behavior unchanged (still uses pkg-config). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Windows provides NUMA support via kernel32.dll, so libnuma (Linux-only) is not needed. Skip the find_library check on WIN32. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- task_queue.cpp: guard unused pthread.h/sched.h includes - worker_pool.h: guard numa.h, use hwloc fallback for set_to_numa on Win - worker_pool.cpp: add Windows compat shims (sched_getcpu, numa_node_of_cpu, numa_num_configured_nodes), guard pthread_setname_np - shared_mem_buffer.cpp: use _aligned_malloc/_aligned_free on Windows, add Windows compat shims for NUMA functions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add NOMINMAX compile definition on Windows to prevent windows.h min/max macros from conflicting with std::min/std::max (fixes C2589) - Wrap numa.h/numaif.h includes in moe.hpp with #ifndef _WIN32 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The extension loader was hardcoded to look for .so files (Linux), but on Windows pybind11 produces .pyd files. Now tries both suffixes on Windows. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…etup.py - Set per-config CMAKE_LIBRARY_OUTPUT_DIRECTORY_RELEASE etc. for MSVC multi-config generators that put output in Release/ subdirectory - Replace hardcoded .so with platform-aware suffix (.pyd on Windows) - Fix multi-variant rename and rglob to handle both .so and .pyd Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The doctor command had hardcoded .so globs for finding kernel extensions. On Windows, Python extensions use .pyd suffix instead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nagement - Dashboard connects to configured SGLang server showing model info, VRAM usage breakdown, throughput metrics, and server configuration - Chat page with conversation history persistence, connection bar for quick URL/model switching, and sidebar for managing multiple chats - Server management with start/stop controls, log viewer, and diagnostics - Model management with scanning, verification, and HuggingFace download - Configuration page with server, paths, inference, and advanced settings - Fix config save: unwrap Vue reactive proxies before Electron IPC calls Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…romotion Implement Plan B tiered weight caching for MoE experts: - Tier 0: NUMA-local malloc buffers (~80ns, hot experts promoted via background thread) - Tier 1: mmap pages resident in RAM (~150ns, OS-managed page cache) - Tier 2: mmap pages on disk (~100us, cold experts evicted under memory pressure) C++ (llamafile/moe.hpp): - Replace flat weight buffers with per-expert pointer vectors (gate/up/down_expert_ptrs_) - Add baseline pointer arrays for demotion fallback - load_weights() supports both mmap zero-copy and legacy memcpy modes via use_mmap flag - promote_expert(): numa_alloc_onnode + memcpy + atomic pointer swap - demote_expert(): restore baseline pointers + numa_free - TP_MOE<LLAMA_MOE_TP> forwards promote/demote to all TP instances - Proper destructor frees NUMA Tier 0 buffers and storage C++ (amx/moe.hpp): - Add mmap zero-copy path in load_weights (std::free original buffers, point to mmap) C++ (ext_bindings.cpp): - Expose promote_expert/demote_expert/is_expert_promoted via pybind11 (LLAMA_MOE_TP only) - Expose use_mmap field on GeneralMOEConfig Python (weight_provider.py - new): - TieredWeightProvider: orchestrates promote/demote via C++ MOE objects - ExpertHotnessTracker: EMA-based activation frequency with vectorized np.add.at() - MmapWeightRegion: zero-copy mmap views with madvise(MADV_WILLNEED) prefetch - Background promotion thread with lazy start (only when first MOE registered) Python (experts_base.py): - Add _tiered_provider singleton and _prev_topk_ids class variables - submit_forward(): prefetch using previous token's expert IDs (no GPU sync) - sync_forward(): record activations after CPU sync (not in hot path) - _register_moe_with_provider(): connect subclass MOE objects to tiered provider Python (llamafile.py): - load_weights() dual-mode: tiered (mmap zero-copy) vs legacy (memcpy) - Register per-expert mmap regions with TieredWeightProvider for prefetch - Register MOE with provider to enable promote/demote Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- CLI: add --weight-strategy (auto/tiered/legacy) to `kt run` command - CLI: pass --kt-weight-strategy to sglang engine launch - CLI: show weight strategy in interactive preview and tuna engine - Frontend: add Memory-Mapped Weights toggle in server config dialog - Frontend: pass --weight-strategy flag to kt process spawn - Add autodl setup script for DeepSeek V3.2 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three fixes for NUMA locality of mmap'd weight pages: 1. mbind in load_weights(): After setting per-expert mmap pointers, call mbind(MPOL_BIND | MPOL_MF_MOVE) to bind/migrate pages to the TP instance's NUMA node. Without this, pages land on whichever node first faults them (typically Python's main thread on node 0). 2. madvise in promote_expert(): Issue MADV_WILLNEED on baseline mmap regions before memcpy to trigger async readahead. Reduces synchronous page fault stalls (~100us each) during promotion. 3. NUMA-aware promote dispatch: TP_MOE::promote_expert() now spawns per-NUMA threads (via set_to_numa) so memcpy executes on the correct NUMA node. Previously ran on Python's unbound promotion thread, causing cross-NUMA traffic for both source reads and destination writes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add KT_TIER0_MEMORY_GB and KT_MAX_TIER0_EXPERTS env vars - Fallback mechanism: CLI params not passed to SGLang directly - BaseMoEWrapper reads env vars when parameters are None - Avoids modifying third-party SGLang code Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove --kt-weight-strategy, --kt-tier0-memory-gb, --kt-max-tier0-experts from SGLang command - SGLang doesn't recognize these parameters (would cause startup failure) - All three parameters now passed via environment variables only - Add KT_WEIGHT_STRATEGY env var support in BaseMoEWrapper Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Use psutil to detect available system memory - Default: all available memory minus 4GB safety margin - Optimized for dedicated inference machines - User can override with --tier0-memory-gb or --max-tier0-experts Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix C++ compilation error in moe.hpp (typename decltype -> decltype) - Fix sync_forward signature for sglang integration compatibility
- Add tier0-memory-gb and max-tier0-experts CLI parameters - Integrate tier0 configuration across MoE kernel modules
…X/moe_kernel backends - Add promote_expert/demote_expert/is_expert_promoted hooks to AMX TP and moe_kernel C++ classes - Fix BF16 cache_capacity_=0 when max_tier0_experts<=0 (was incorrectly defaulting to 1) - Wire max_tier0_experts into AMX moe_config - Add mmap region registration for AMX and moe_kernel backends - Add cgroup v1/v2-aware memory detection in weight_provider.py - Add AMXINT4/AMXINT8/MOE_INT4/MOE_INT8 to _provider_backends in experts_base.py - Fix pre-commit hook shebang and mapfile for macOS bash 3.2 - All 11 unit tests passing
- Merged upstream changes including GPTQ expert loading support - Resolved conflict in loader.py: kept both our mmap methods and upstream GPTQSafeTensorLoader.load_experts()
Sprint 1: io_uring infrastructure - Add AsyncExpertReader class (cpu_backend/async_io.hpp/cpp) - Wrap io_uring for async expert weight loading from SSD - Support submit_read, wait_one_completion, poll_completions, wait_for_expert - CMakeLists.txt: detect liburing and link if available Sprint 2: AMX integration and cache stats - common.hpp: add IOBackend enum, ExpertFileSlot, ExpertCacheStats - amx/moe.hpp: add io_uring branch in allocate_and_copy_expert() - amx/moe.hpp: add cache hit/miss/promote/demote stats in ensure_expert_ready() - Conditional compilation with HAVE_LIBURING for Linux-only support Key benefits: - Eliminate OS page cache dependency (no double-occupancy) - Predictable I/O latency with completion notifications - Cache hit rate instrumentation for policy comparison Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add Python-side support for io_uring-based expert loading: 1. Python bindings (ext_bindings.cpp): - Expose AsyncExpertReader class to Python - Expose IOBackend enum (MMAP, IOURING) 2. Utility modules: - async_io_manager.py: Global AsyncExpertReader singleton - loader.py: load_experts_iouring() for file descriptor extraction 3. CLI integration (run.py): - Add --io-backend option (mmap/iouring) - Add --enable-cache-stats flag - Pass via KT_IO_BACKEND and KT_ENABLE_CACHE_STATS env vars 4. Configuration (experts_base.py): - Read io_backend and enable_cache_stats from environment - Store in wrapper config for C++ consumption 5. Unit tests (test_async_io.py): - Test AsyncExpertReader basic read - Test batch reading with multiple experts - Test timeout behavior - Test IOBackend enum Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…arameter Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Fix BLOCK_SIZE macro conflict with liburing by undefining after include - Fix AsyncExpertReader Python binding syntax (add missing semicolon) - Add completed_experts_ tracking to distinguish completed vs never-submitted experts - Update wait_for_expert() to return false for never-submitted experts (fixes timeout test) - Remove num_workers parameter from test calls (already removed from C++ constructor) - Remove shutdown() method calls from tests (method doesn't exist in C++) - Fix test import path to use kt_kernel_ext from build directory All async_io unit tests now pass (4/4 passed) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a new Electron-based frontend for KTransformers, providing a GUI for model management, server control, and system monitoring. It also includes significant backend enhancements, such as io_uring support for expert weight loading, NUMA-aware memory management, and a tiered weight residency strategy to optimize performance when model sizes exceed physical RAM. Several security and performance issues were identified in the review, including insecure Electron settings, command injection risks in IPC handlers, and inefficient resource management in the hot path of the inference engine.
| preload: join(__dirname, 'preload.js'), | ||
| contextIsolation: true, | ||
| nodeIntegration: false, | ||
| webSecurity: false // allow cross-origin requests to remote API servers |
There was a problem hiding this comment.
Disabling webSecurity is a significant security risk in Electron applications as it bypasses the Same-Origin Policy. This allows the renderer process to make requests to any domain, which can be exploited if the application ever loads untrusted content. It is recommended to keep webSecurity enabled and handle cross-origin requirements via proper CORS configuration on the server or by using a proxy in the main process.
| if (model) args.push('--model', model) | ||
| args.push('--output', tmpFile) | ||
| return new Promise((resolve) => { | ||
| const proc = spawn('kt', args, { shell: true }) |
There was a problem hiding this comment.
Using shell: true with spawn can lead to command injection vulnerabilities if any part of the arguments is user-controlled. Since args is already an array, shell: true is unnecessary and should be set to false (which is the default) to execute the command directly without a shell. This improvement should be applied to all spawn calls in this file and other service files where shell: true is used.
| const proc = spawn('kt', args, { shell: true }) | |
| const proc = spawn('kt', args, { shell: false }) |
| auto accum_output_holder = alloc_aligned_f32(output_elems); | ||
| auto wave_output_holder = alloc_aligned_f32(output_elems); |
There was a problem hiding this comment.
Performing heap allocations (posix_memalign or _aligned_malloc) inside the forward method for every prefill pass is a significant performance bottleneck and can lead to memory fragmentation. These buffers should be pre-allocated during initialization or managed through a reusable memory pool to ensure optimal performance in the hot path.
| void configure_experts(int layer, int n) { | ||
| if (n <= 0) return; | ||
| std::lock_guard<std::mutex> guard(expert_mu); | ||
| if (layer_idx == layer && expert_num == n && expert_access_count != nullptr) { | ||
| return; | ||
| } | ||
| layer_idx = layer; | ||
| expert_num = n; | ||
| dump_path.clear(); | ||
| if (const char* path = std::getenv("KT_EXPERT_STATS_PATH")) { | ||
| dump_path = path; | ||
| } | ||
| dump_every = 1; | ||
| if (const char* raw_every = std::getenv("KT_EXPERT_STATS_DUMP_EVERY")) { | ||
| char* end = nullptr; | ||
| const unsigned long long parsed = std::strtoull(raw_every, &end, 10); | ||
| if (end != raw_every && parsed > 0) { | ||
| dump_every = static_cast<uint64_t>(parsed); | ||
| } | ||
| } | ||
| expert_access_count = std::make_unique<std::atomic<uint64_t>[]>(n); | ||
| expert_hit_count = std::make_unique<std::atomic<uint64_t>[]>(n); | ||
| expert_miss_count = std::make_unique<std::atomic<uint64_t>[]>(n); | ||
| expert_cold_miss_count = std::make_unique<std::atomic<uint64_t>[]>(n); | ||
| expert_in_flight_miss_count = std::make_unique<std::atomic<uint64_t>[]>(n); | ||
| expert_promote_count = std::make_unique<std::atomic<uint64_t>[]>(n); | ||
| expert_prefetch_hit_count = std::make_unique<std::atomic<uint64_t>[]>(n); | ||
| for (int i = 0; i < n; ++i) { | ||
| expert_access_count[i].store(0, std::memory_order_relaxed); | ||
| expert_hit_count[i].store(0, std::memory_order_relaxed); | ||
| expert_miss_count[i].store(0, std::memory_order_relaxed); | ||
| expert_cold_miss_count[i].store(0, std::memory_order_relaxed); | ||
| expert_in_flight_miss_count[i].store(0, std::memory_order_relaxed); | ||
| expert_promote_count[i].store(0, std::memory_order_relaxed); | ||
| expert_prefetch_hit_count[i].store(0, std::memory_order_relaxed); | ||
| } | ||
| } | ||
|
|
||
| void note_expert_access(int expert_id) { | ||
| if (expert_id < 0 || expert_id >= expert_num || expert_access_count == nullptr) return; | ||
| expert_access_count[expert_id].fetch_add(1, std::memory_order_relaxed); |
There was a problem hiding this comment.
There is a potential race condition in ExpertCacheStats. The configure_experts method reallocates the expert_access_count array while holding expert_mu, but the note_expert_access method (and others like note_expert_hit) accesses this array without any synchronization. If configure_experts is called while inference threads are active, it could lead to use-after-free or out-of-bounds access. Consider using a synchronization mechanism that protects the array pointer itself during access, or ensure that configuration only happens when no inference is running.
| return new Promise((resolve) => { | ||
| let output = '' | ||
| let errOutput = '' | ||
| const proc = spawn('python3', [scriptPath], { shell: false }) |
There was a problem hiding this comment.
Hardcoding python3 may cause the application to fail on Windows systems where the executable is typically named python. Consider detecting the platform or using a configurable path for the Python interpreter.
| const proc = spawn('python3', [scriptPath], { shell: false }) | |
| const proc = spawn(process.platform === 'win32' ? 'python' : 'python3', [scriptPath], { shell: false }) |
| } | ||
|
|
||
| const bool enable_wave_mode = []() { | ||
| const char* raw = std::getenv("KT_ENABLE_BF16_WAVE_RESIDENT"); |
There was a problem hiding this comment.
| static float act_fn(float x) { return x / (1.0f + expf(-x)); } | ||
|
|
||
| static inline bool use_row_dot_debug_fallback() { | ||
| const char* env = std::getenv("KT_LLAMA_USE_ROW_DOT"); |
What does this PR do?
This PR introduces MESH, an experimental memory-tiered expert residency system for KTransformers MoE inference.
MESH is designed for heterogeneous CPU-GPU MoE serving when the full expert working set cannot stay comfortably resident in host DRAM. The current KT path works well when expert weights are available from normal memory-backed storage, but constrained-memory deployments expose a difficult systems problem: expert weights are sparse at runtime, but the full expert set must remain accessible. Relying only on
mmapleaves expert residency mostly under OS page-cache control.MESH adds an explicit runtime-managed expert residency layer for AMXINT4 MoE inference. It can load expert weights from NVMe through
io_uring+O_DIRECTinto NUMA-local CPU buffers, manage a bounded resident expert slot pool, and preserve the existing KT AMX compute path once the required experts are resident.The goal is not to replace KT's AMX kernels. The goal is to make expert residency explicit: which experts are in CPU memory, which experts are cold, when cold experts are read, and how prefill/decode should share the resident slot pool.
Why this is useful
On constrained-memory machines,
mmapmakes MoE expert residency hard to control:MESH gives KTransformers a runtime-level mechanism to manage this explicitly:
io_uring,This is useful for local single-node MoE serving, workstation deployments, and memory-constrained CPU-GPU systems where CPU AMX, GPU attention, NUMA locality, and NVMe-backed expert storage need to work together.
Main components
1.
io_uringdirect-I/O expert loadingMESH adds an
io_uring-based loading path for expert weights. WithO_DIRECT, expert tensors can be read from storage directly into application-owned buffers, avoiding the OS page cache on the MESH path.The async reader tracks request completion explicitly and validates read results. It also includes retry/validation logic for failed or incomplete reads.
2. NUMA-local resident expert slots
MESH adds a resident slot pool for CPU-managed experts. Each resident expert is associated with slot-owned buffers for its expert tensors. The slot metadata tracks expert state, slot state, and active readers so that a resident expert is not evicted while it is being used by the AMX forward path.
The slot pool is virtual: a slot is not permanently tied to a fixed expert id. It can be rebound to different experts as promotion and eviction happen.
3. Batched cold expert promotion
For a layer forward, MESH can collect the CPU experts needed by that forward pass, identify which ones are already resident, submit reads for the cold ones, wait for those reads to complete, bind the completed buffers into slots, and then call the existing KT AMX
Base::forward()path.This keeps the compute path close to KT's original implementation while moving expert residency decisions into an explicit runtime layer.
4. Cache policy and heat-aware residency
MESH keeps a bounded resident set and uses a policy-driven eviction path. The current implementation supports a SIEVE-style base policy and a heat/lookahead signal derived from router scores. The intent is to retain experts that are likely to be reused while allowing cold or low-value experts to be demoted.
The implementation also supports full-gate score observation and skips unnecessary observation when the effective resident capacity already covers all CPU-managed experts.
5. Deferred-expert aware decode behavior
MESH integrates with KT's deferred expert execution. Decode can distinguish resident experts from cold experts and issue prefetches for cold deferred experts. The accounting separates normal hits, cold misses, and in-flight misses so that async prefetch behavior is not mistaken for a fully cold miss.
6. Prefill residency support
MESH includes prefill-specific residency support. The current branch contains a prefill layer-window mode and transition logic back into decode hot-cache mode. This allows prefill and decode to interpret the resident slot pool differently while still sharing the same underlying slot abstraction.
The implementation also supports configurable early-layer residency through
KT_MESH_EARLY_LAYER_EXPERTS, because early MoE layers can have different miss behavior from deeper layers.7. GPU expert compatibility
MESH does not treat GPU experts as CPU resident-slot candidates. CPU-side residency logic checks the actual per-layer GPU expert mask and skips experts that are already assigned to GPU execution.
This matters because GPU experts are not necessarily a fixed prefix of expert ids and may vary by layer or under dynamic expert placement.
Compatibility
MESH is intended to be opt-in through the IOURING backend. The existing mmap/default path should continue to behave as before.
The implementation is designed to preserve KT's existing execution assumptions:
Current validation status
This work has been validated at two levels.
Automated/unit-level tests
The branch includes unit-level coverage for the async I/O layer, including:
The short-read test verifies that a completed-but-incomplete read is not treated as a successful request.
Manual/system-level validation
MESH has also been tested manually on a dual-socket AMX server with Qwen3.5-35B AMXINT4 weights, NVMe-backed expert storage, NUMA execution, and SGLang/KTransformers serving.
The manual validation included:
io_uringdirect-read behavior under AMXINT4 expert loading.iostat.Some of these experiments are saved under local paper/experiment artifact directories and are not suitable as normal CI tests.
The full MESH prefill/decode system test is not currently included as an automated per-commit test because it is hardware- and model-dependent: it requires AMX-capable CPUs, the AMXINT4 expert weight layout, NVMe-backed weights, NUMA configuration, and a running SGLang/KTransformers server.
This PR should be reviewed as an experimental systems path rather than a fully production-hardened default backend.
Fixes # (issue)
Before submitting
Note: automated tests exist for the async I/O layer. MESH has also been manually validated on the target AMX/NVMe/model-weight infrastructure, but those full system tests are not included in per-commit CI.