Skip to content

Comments

[doc] add flexkv_config.json introduce#15

Merged
peaceforeverCN merged 1 commit intobugfixfrom
args_docs
Sep 15, 2025
Merged

[doc] add flexkv_config.json introduce#15
peaceforeverCN merged 1 commit intobugfixfrom
args_docs

Conversation

@peaceforeverCN
Copy link
Collaborator

No description provided.

@peaceforeverCN peaceforeverCN merged commit 36fcf1c into bugfix Sep 15, 2025
1 check passed
@peaceforeverCN peaceforeverCN deleted the args_docs branch September 15, 2025 13:13
wenpengw-nv pushed a commit to wenpengw-nv/FlexKV that referenced this pull request Oct 20, 2025
Co-authored-by: zuogan <zuogan@tencent.com>
linhu-nv added a commit that referenced this pull request Nov 25, 2025
* fix logic problems in client

* quick fix to client-server example

* fix some bugs to run compile

* modify blockmeta definition & impl mempool

* impl index

* add index benchmark

* faster hash

* add test for storage_engine + transfer_engine

* optim index

* fix a few bugs about performance, should have normal perf now

* add mempool benchmark

* optim mempool

* optimize index.insert

* fix mempool

* eviction implementation and optimization

* fix evictor

* flatten free ids tensor

* impl get/put pipeline

* add reset for cache engine

* global id allocator

* list to tensor

* init kvmanager

* run the pipeline

* print cpu-gpu transfer info

* refactor index

* refactor kvmanager and cache engine

* add insert_length for insert

* cleanup buffer

* Layer wise

* Added the xxhash

* add cmake

* adapt use xxhash

* fix bug

* refactor

* fix

* remove comment

* add benchmark and refactor code

* fix bug

* radix tree

* update

* update

* fix

* quick fix to ssd file allocation

* remove batch transfer

* quick fix to transfer_op initialization

* check tensor type

* support ce transfer

* modify graph creation

* remove dataclass

* fix to illegal memory access

* fix to transfer_op default values

* remove gitlab-ci yml

* submodule install automatically

* init README

* add cython

* use multiple processes for transfer workers to bypass GIL

* fix multi queue import

* quick fix

* support different layout && multiple ssd (#3)

* add debug mode

* default to debug

* default to debug

* NVTX Integration && I/O Optimization (#6)

* fix

* add nvtx

* fix from_numpy

* include remote cache engine, use ssd as fake remote file for test

* some case works now

* fix

* support parallel remote read/write

* add nvtx info

* cpu-only

* reduce polling sleep time

* remove unused link

* Support for multi-SSD read/write and round-robin layout (#8)

* support multi-ssd

* clear gpu blocks after put

* fix permission denied

* split cpp

* update

* share tensor (#9)

* test tensor sharing by zmq (#10)

* fix return

* test zmq

* Implement Server Client Mode (#11)

Currently not support DP

* merge leolingli/dev-2 to dev (#13)

* add remote cache and pcfs
this version only realize pcfs and reserve the following optimization:
    1. pcfs should be a class of remote cache, not the only solution
    2. pcfs arg need configurable
    3. pcfs need support multi file

* add remote cache multi file
add customer args for remote cache and add pcfs args
change Chinese comment to English
move pcfs.c & .h to pcfs dir
add test_remote_kvmanager for remote test

---------

Co-authored-by: leolingli <leolingli@tencent.com>

* add FLEXKV_ENABLE_CFS env var to control build/link flags (#14)

* some optimizations about transfer graph and task status trackers

* rebased on the latest dev branch, draft

* tp enabled, draft, commited for multi-gpu test

* tp&dp support works now on test_kvmanager

* quick fix of multi_process test, the return masks is ok now, results not accurately verified

* some small fixes, deal with empty graphs

* debug for tp and dp (#15)

Co-authored-by: zuogan <zuogan@tencent.com>

* fix corner case && add pytest (#16)

* support split used radix-node

* add pytest

* add exceptions && unit-tests

* Standardize import & add .so to system path to avoid export manaually

* refactor transferWorkers

* quick fix

* add dtype into the modelConfig so that we can support different dtypes

* use ordered_dict to implement a expiring dict

* quick fix

* quick fix to build.sh

* refactor get/put (#19)

* refactor get/put

* deal with no_space error

* local after get/put impl

* test kvmanager in pytest

* fix test

* Support (async) query interface.  (#20)

* Support (async) query interface

* fix import

* replace 'try_wait_at_layer_group' with 'query_at_layer_group'

* rename

---------

Co-authored-by: 869974612@qq.com <scutizhang@tencent.com>

* in the return mask of put op, only set tokens that are really transfered as True

* use GPUCPUTransferWorker when the tp size is only 1 & fix a bug of test script

* add pytest unit for transfer_engine

* add mypy (#22)

* mypy update

* refactor of transfer worker

* refactor and fix mypy error

* fix

* update pyproject.toml

* fix tests

* fix insert return

* rename logger

---------

Co-authored-by: linhu-nv <linhu@nvidia.com>

* refactor tensor handle (#25)

* small fixes

* now we accept gpu_kv_layout as a parameter, and automically infer other layouts in storage engine

* fix of kvlayout generation, manually launch transfer_engine

* quick fix

* add mla support

* refactor graph generation (#28)

* change remote cache: (#30)

add mla support for remote cache
  use truncate function instead of write when init to change file size
  add remote file_size config, can config file size in remote cache
  init file in storage engine
  add sequential mode(compared with round_roubin) and use in remote cache

todo:
  pytest not support pcfs, pcfs init's thread cannot be found in pytest, use python instead
  remote cache async read write.
  is_mla test

Co-authored-by: leolingli <leolingli@tencent.com>

* quick fix for mla + tp

* support both layer-wise and block-wise storage for cpu-blocks (#29)

* support both layer-wise and block-wise storage for cpu-blocks

* optimize multi-ssd rw

* support ssd-cpu blockwise transfer

* refactor kvlayout

* fix config

* fix bugs of mla+tp and pytest

---------

Co-authored-by: zhuofanl <zhuofanl@nvidia.com>
Co-authored-by: root <root@H20-GPU-10.cm.cluster>

* quick fix for len(token_ids) < tokens_per_blocks and len(True in mask) < token_per_block (#33)

* skip unready block in put (#34)

* io_uring support for ssd cache (#32)

* io_uring support for ssd cache

Add io_uring support for ssd cache to accelerate io performance.

Signed-off-by: charliecgxu <charliecgxu@tencent.com>

* tests: adding test cases for iouring

adding cases for iouring function/performance testing.

Signed-off-by: charliecgxu <charliecgxu@tencent.com>

* tests: adding test cases to test_kvmanager.py

adding iouring test cases to test_kvmanager.py as well.

Signed-off-by: charliecgxu <charliecgxu@tencent.com>

---------

Signed-off-by: charliecgxu <charliecgxu@tencent.com>

* add tracer that can record all requests of flexKV, and the replay script (#36)

* fix to radixtree (#38)

* repair kvmanager verify logic to fit no remote cache situation (#39)

change cpu_tensor_ptr to run remote cache

* quick fix for radix tree (#40)

* add header file dependency (#41)

Signed-off-by: charliecgxu <charliecgxu@tencent.com>

* fix precommit format issues (#42)

* change try_wait and only return the finished request (#43)

* use numpy instead of tensor for zmq communication (#45)

* use fadvise correctly (#44)

Signed-off-by: charliecgxu <charliecgxu@tencent.com>

* task id from client for faster get/put_async (#47)

* fix some improper variable names (#48)

Signed-off-by: charliecgxu <charliecgxu@tencent.com>

* add config (#46)

* more config for gpu transfer

* configure max blocks per file

* spread IO to as many files as possible (#49)

Signed-off-by: charliecgxu <charliecgxu@tencent.com>

* refactor pytest: add test_utils; add server-client mode

* print input parameters correctly in the error case of iouring

Signed-off-by: charliecgxu <charliecgxu@tencent.com>

* remove the restriction of pin memory for iouring

Currently, we do not register io buffers for iouring,
so there is no restriction of pin memory.

Signed-off-by: charliecgxu <charliecgxu@tencent.com>

* add unit benchmark for workers (#52)

* worker unit benchmark

* fix

* format adjusted; delete two test scripts

* remove direct flag for ssd write

The write performance of SSD is much worse than the read performance,
so remove O_DIRECT flag when doing write operation.

Signed-off-by: charliecgxu <charliecgxu@tencent.com>

* refactor benchmark_cache_engine and fix some issues (#55)

* update

* fix tests

* rename utils.py

* fix bug in benchmark

* refactor random request generation

* fix rebase error

* avoid opening ssd files per io request

Signed-off-by: charliecgxu <charliecgxu@tencent.com>

* fix bug, ssd_layer_stride_in_bytes compute error

* add e2e benchmark for kvmanager (#58)

* fix default params

* add server_schedular for reduce process communication overhead (#59)

* Unify the interface of the flexkv server (#60)

* select direct_io fds only in read && direct mode

Signed-off-by: charliecgxu <charliecgxu@tencent.com>

* integrate expiring_dict

* use Pipe instead of Queue for comm in transferEngine and worker

* limit the id range

* fallback to preadv/pwritev when iouring inflight request over limit (#64)

Signed-off-by: charliecgxu <charliecgxu@tencent.com>

* reduce bubble between op launch (#65)

* some small fixes (#66)

* remove worker_init_timeout_minutes (#67)

* quick fix that the is_mla is not given to tp_client

* add a message when error (#71)

1. fix bug, need return torch.empty(0, dtype = bool) or return float will cause vllm int add float problem
2. delete info in client for performance
3. add is_ready for client to determine whether flexkv is ready

* radix tree c++ impl (#70)

* radix tree implementation in c++

Signed-off-by: charliecgxu <charliecgxu@tencent.com>

* support new radix-tree in cache engine

Signed-off-by: charliecgxu <charliecgxu@tencent.com>

---------

Signed-off-by: charliecgxu <charliecgxu@tencent.com>

* sync kernel launch

* kvmanager refactor (#73)

* add KVCacheEngineClient APIs

* basic implementation for KVCacheEngineClient

* initial transfer manager

* init transfer handle

* init kv engine

* refactor kvmanager

* update kvmanager

* some refactor

* kv response

* add benchmark

* serialize graph

* fix bugs

* ready check

* update

* rename

* rename benchmark

* use numpy instead of tensor

* small fix

* remove transfer descriptor

* rename to kvmanager

* update api

* add gpu-kvcache-verifier, draft

* update

* create a new tp-worker process and create gpu blocks for verification

* rename

* the test_kvmanager works now

* fix virtual op initialize

* fix verifier bug when tp > 1 and mla enabled

* fix

* remove task id && some fix

* only create one h2d op

* pass slotmapping for launch

* quick fix

---------

Co-authored-by: linhu-nv <linhu@nvidia.com>
Co-authored-by: Fei Liang <hanyueh@nvidia.com>

* feat: add support release wheel (#77)

* feat: add support release wheel

Signed-off-by: lilgao <lilgao@tencent.com>

* fix copilot review for ci

Signed-off-by: lilgao <lilgao@tencent.com>

---------

Signed-off-by: lilgao <lilgao@tencent.com>
Co-authored-by: lilgao <lilgao@tencent.com>

* add evict_ratio in cache config, default is 0
evict number is max( int(mempool.num_total_blocks*evict_ratio), former
evict number )

* update unit tests for new version (#79)

* update test cache engine

* update test cache engine accel

* remove some tests

* rename functions

* enable profile in release build

Signed-off-by: charliecgxu <charliecgxu@tencent.com>

* update benchmark worker (#82)

* update benchmark worker

* status map

* enable nvtx

* fix default config

* clear cpu to test ssd cache

* ci: trigger on main and dev

Signed-off-by: lilgao <lilgao@tencent.com>

* fix broken cpp radix tree support for cache engine (#84)

* adjust index accel to new cache engine data struct

Signed-off-by: charliecgxu <charliecgxu@tencent.com>

* fix broken tests for cache engine

Signed-off-by: charliecgxu <charliecgxu@tencent.com>

---------

Signed-off-by: charliecgxu <charliecgxu@tencent.com>

* fix direct io

* Using ring buffer in transfer engine to manage the src and dst block ids instead of using pin memory function inner the launch kernel for reducing the bubble

* quickfix for return type of reduce_tensor

* fix bug

* refine ring_buffer and apply it to all workers

* rename PinnedMemoryRing to SharedMemoryRing

* fix status bug

* allow to exceed the max_block_num

* refactor: use hash to allocate buffer && no wait for free slot

* add gds

* fix batch sync

* add gds transfer worker support

* op-level callback

* fix bugs

* support dp > 1 while integrated with vllm

* avoid redundant d2h data transfer for mla in tp

* add gds worker & test

* gdsput changed to original ssdtransfer

* fix callback bug

* add gds docs

* remove redundant code

* Tp16 support (#26)

* initial tp16 support

* avoid global mp context setting

* model_config for transfer

* configured by master node

* build flag & assert

* refactor gds transfer thread

* [feature] Different gtensor layouts (#27)

* support both vllm gpu tensor layouts and trtllm gpu tensor layouts

* fix some small bugs

* use template specialization to support different gpu tensor layouts

* skip gds tests when not intended

* add test for sglang

---------

Co-authored-by: Fei Liang <feliang@nvidia.com>
Co-authored-by: root <root@H20-GPU-10.cm.cluster>
Co-authored-by: annz <annz@nvidia.com>

* quick fix to benchmark_worker (#31)

Co-authored-by: annz <annz@nvidia.com>

* [feature] optimize SSD I/O (#33)

* blockfirst ssd io

* set io_uring flag=1

* batch sync for iouring

* bench bidirection transfer

* swap loop to improve multi-SSD bandwidth

* prefer read

* [Feature] Implement grace time with hit reward for cache node

This commit introduces a grace time mechanism with hit reward seconds
for cache nodes in the radix tree index. Key changes include:

- Replace last_access_time with grace_time in RadixNode
- Add hit_reward_seconds parameter to CacheEngine and RadixTreeIndex
- Update node time calculation: on cache hit, extend grace time by
  hit_reward_seconds instead of resetting to current time
- Add hit_reward_seconds configuration option in CacheConfig

The new mechanism helps prioritize frequently accessed nodes in cache
eviction by rewarding hits with extended grace periods.

Signed-off-by: charliecgxu <charliecgxu@tencent.com>

* [Feature] Extend grace time with hit reward to accelerated index

This commit extends the grace time with hit reward mechanism to the
accelerated C++ radix tree implementation. Key changes include:

- Add hit_reward_seconds parameter to CRadixTreeIndex constructor
- Modify CRadixNode to use grace_time instead of last_access_time
- Implement same time update logic: extend grace time by hit_reward_seconds
  on cache hits rather than resetting to current time
- Update Python bindings to pass hit_reward_seconds to C++ index

Ensures consistent cache behavior between Python and accelerated C++
implementations.

Signed-off-by: charliecgxu <charliecgxu@tencent.com>

* quick fix of match_prefix

* [misc] Replace std::map with std::unordered_map in RadixTree

Signed-off-by: Zhaohu Xing <x.zhaohu@gmail.com>

* [feature] simplify user configuration (#37)

* add global config from env

* use config from env

* simplify port config

* remove max_req_tokens

* simple user config

* update flexkv_config

* fix benchmark

* remove unused example

* modify config doc

* fix iouring flag && allow user config override env

* update all docs

* rename layout type

* small fix

* update tracer

* adjust ssd blocks num if necessary

* fix broken tests

---------

Co-authored-by: linhu-nv <linhu@nvidia.com>

* quick fix of config (#43)

Co-authored-by: jianyingzhu <joeyzhu@nvidia.com>

* [feature] GDS refactor & gtensor support (#42)

* [refactor] gds reuse ssd handle

* refactor tpGDS

* minor naming changed

* gtensor gds support

* update doc & test

* Support construct TensorSharedHandle directly from CUDA IPC Handle

* add test file for TensorSharedHandle

* add scripts for vllm adapter

* [bugfix] fix port (#45)

* [bugfix] fix ssd allocator (#46)

* [bug fix] fix some bugs && cleanup code (#49)

* fix benchmark

* fix incorrect MatchResult

* use int64_t for offset

* fix bugs && update docs

* update config file

* fix env name

* remove useless exceptions

* quick fix

* Support using FlexKV on TensorRT-LLM (#48)

* prevent automatically initializing MPI

* disable auto-mpi-init

* Init support for TensorRT-LLM

* add scripts

* fix import and interface

* support the trtllm gpu layout and improve register api of trt_adapter

* modify log

* modify scripts

* use remote transfermanager

* some fix by hulin

* using subprocess instead of multiprocessing

* fix dead lock

* fix some bugs about gpu_register_port

* fix tensor export

* fix head_size

* fix num_kv_heads for deepseek

* fix ipc open error

* fix head_size calculationg error

* fix interface

* fix get num_matched_tokens from trtllm

* fix head_size calculationg error

* fix interface

* fix short len

* remove code

* add patch file

* modify scripts

* tensorRT LLM will wait until kvmanager isready

* [bugfix] fix token alignment issue in tensorrt-llm by rounding down to block size

* trivial

* support flexkv + cuda graph using flexkv

* modify patch

* modify scripts

* [bugfix] fix some bug

fix bug

* fix redix_tree

* modify scripts

* add debug log

* modify scripts

* fix rebase error

* fix radix tree

* fix scripts

* use new config

* rename

* fix script

* add branch for calculation of aligned_length

* add branch for remote_process

* take another way to determine branch

* fux scripts

* remove useless env and config

* remove useless commit

---------

Co-authored-by: zhuofan1123 <zhuofanl@nvidia.com>
Co-authored-by: linhu-nv <linhu@nvidia.com>
Co-authored-by: Luis-xu <hfutxjn@163.com>
Co-authored-by: annz <annz@nvidia.com>
Co-authored-by: leolingli <leolingli@tencent.com>

* Rebase and merge bugfix to dev (#51)

* [bugfix] fix for deepseek head number wrong

* [bugfix] fix bug, if cpu match len is bigger than ssd when put it will cause error

* fix redix_tree (#39)

* fix empty

---------

Co-authored-by: leolingli <leolingli@tencent.com>

* quick fix

* Fix bug found by unit test (#55)

* Add patch and doc for trtllm (#52)

* add patch

* init docs

* fin readme

* rename yml

* fix readme

* fix readme

* update docs

* fix docs

* fix docs

* fix docs

* add title

* add readme_en

* Update docs/trtllm_adaption/README_en.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: zhuofan1123 <zhuofanl@nvidia.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* [bugfix] put trtllm env set before kvmanager init (#58)

* [feature] support Tp16 for vllm+flexkv (#59)

* fix bug for tp16

* update vllm adapter to support tp16

* quick fix for tp16 (#62)

---------

Signed-off-by: charliecgxu <charliecgxu@tencent.com>
Signed-off-by: lilgao <lilgao@tencent.com>
Signed-off-by: Zhaohu Xing <x.zhaohu@gmail.com>
Co-authored-by: linhu-nv <linhu@nvidia.com>
Co-authored-by: zhuofanl <zhuofanl@nvidia.com>
Co-authored-by: PY <peiyuanz@nvidia.com>
Co-authored-by: menyu <menyu@H20-GPU-05.cm.cluster>
Co-authored-by: linhu-nv <141609318+linhu-nv@users.noreply.github.com>
Co-authored-by: Zuo Gan <106919589+gz944367214@users.noreply.github.com>
Co-authored-by: zuogan <zuogan@tencent.com>
Co-authored-by: Rongwei Zhang <34190091+axxx03@users.noreply.github.com>
Co-authored-by: 869974612@qq.com <scutizhang@tencent.com>
Co-authored-by: root <root@H20-GPU-10.cm.cluster>
Co-authored-by: charliecgxu <72770768+charliecgxu@users.noreply.github.com>
Co-authored-by: Fei Liang <hanyueh@nvidia.com>
Co-authored-by: charliecgxu <charliecgxu@tencent.com>
Co-authored-by: moritzxu <moritzxu@tencent.com>
Co-authored-by: Peng Gao <peng.gao.dut@gmail.com>
Co-authored-by: lilgao <lilgao@tencent.com>
Co-authored-by: jianyingzhu <53300651@qq.com>
Co-authored-by: wenpengw-nv <wenpengw@nvidia.com>
Co-authored-by: Fei Liang <feliang@nvidia.com>
Co-authored-by: annz <annz@nvidia.com>
Co-authored-by: Zhaohu Xing <x.zhaohu@gmail.com>
Co-authored-by: Jianying <53503712+jianyingzhu@users.noreply.github.com>
Co-authored-by: jianyingzhu <joeyzhu@nvidia.com>
Co-authored-by: Luis-xu <hfutxjn@163.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant