[feature]Add GDS (GPU Direct Storage) Support#25
[feature]Add GDS (GPU Direct Storage) Support#25peaceforeverCN merged 214 commits intotaco-project:devfrom
Conversation
…ids instead of using pin memory function inner the launch kernel for reducing the bubble
…e_fix quickfix for return type of reduce_tensor
…_buffer_manager Main process manages the ring buffer of block ids
linhu-nv
left a comment
There was a problem hiding this comment.
@wenpengw-nv can you please see my comments and address them? thanks
| threads_.reserve(num_gpus_); | ||
|
|
||
| for (int i = 0; i < num_gpus_; ++i) { | ||
| threads_.emplace_back([&, i]() { |
There was a problem hiding this comment.
the latest "tp_transfer_thread_group.cpp" now doesn't create dynamic threads in each transfer, instead, it can create a thread pool in the initilization, and then reuse all threads created before, which is much faster. Can you please help to refactor this as "tp_transfer_thread_group.cpp" does? thanks.
There was a problem hiding this comment.
Already refactored to thread pool pattern as tp_transfer_thread_group.cpp. Please review.
| if not cache_config.enable_cpu and not cache_config.use_gds: | ||
| raise ValueError("use_gds must be True if enable_cpu is False") | ||
| if not cache_config.enable_cpu and not cache_config.enable_gds: | ||
| raise ValueError("enable_gds must be True if enable_cpu is False") |
There was a problem hiding this comment.
maybe we can add an assertion that enable_ssd and enable_gds cannot be used at the same time
setup.py
Outdated
|
|
||
| extra_link_args = ["-lcuda", "-lxxhash", "-lpthread", "-lrt", "-luring"] | ||
| extra_compile_args = ["-std=c++17"] | ||
| extra_link_args = ["-lcuda", "-lxxhash", "-lpthread", "-lrt", "-luring", "-lcufile"] |
There was a problem hiding this comment.
We should allow flexKV to run without GDS and GDS depencencies such as cufile. So, maybe we can also decide whether to add gds-related compile or link flags according to a environmental var such as "ENABLE_GDS". Only when this env var is true, we add those flags.
There was a problem hiding this comment.
Done. Added FLEXKV_ENABLE_GDS environment var
|
lgtm |
* fix logic problems in client * quick fix to client-server example * fix some bugs to run compile * modify blockmeta definition & impl mempool * impl index * add index benchmark * faster hash * add test for storage_engine + transfer_engine * optim index * fix a few bugs about performance, should have normal perf now * add mempool benchmark * optim mempool * optimize index.insert * fix mempool * eviction implementation and optimization * fix evictor * flatten free ids tensor * impl get/put pipeline * add reset for cache engine * global id allocator * list to tensor * init kvmanager * run the pipeline * print cpu-gpu transfer info * refactor index * refactor kvmanager and cache engine * add insert_length for insert * cleanup buffer * Layer wise * Added the xxhash * add cmake * adapt use xxhash * fix bug * refactor * fix * remove comment * add benchmark and refactor code * fix bug * radix tree * update * update * fix * quick fix to ssd file allocation * remove batch transfer * quick fix to transfer_op initialization * check tensor type * support ce transfer * modify graph creation * remove dataclass * fix to illegal memory access * fix to transfer_op default values * remove gitlab-ci yml * submodule install automatically * init README * add cython * use multiple processes for transfer workers to bypass GIL * fix multi queue import * quick fix * support different layout && multiple ssd (#3) * add debug mode * default to debug * default to debug * NVTX Integration && I/O Optimization (#6) * fix * add nvtx * fix from_numpy * include remote cache engine, use ssd as fake remote file for test * some case works now * fix * support parallel remote read/write * add nvtx info * cpu-only * reduce polling sleep time * remove unused link * Support for multi-SSD read/write and round-robin layout (#8) * support multi-ssd * clear gpu blocks after put * fix permission denied * split cpp * update * share tensor (#9) * test tensor sharing by zmq (#10) * fix return * test zmq * Implement Server Client Mode (#11) Currently not support DP * merge leolingli/dev-2 to dev (#13) * add remote cache and pcfs this version only realize pcfs and reserve the following optimization: 1. pcfs should be a class of remote cache, not the only solution 2. pcfs arg need configurable 3. pcfs need support multi file * add remote cache multi file add customer args for remote cache and add pcfs args change Chinese comment to English move pcfs.c & .h to pcfs dir add test_remote_kvmanager for remote test --------- Co-authored-by: leolingli <leolingli@tencent.com> * add FLEXKV_ENABLE_CFS env var to control build/link flags (#14) * some optimizations about transfer graph and task status trackers * rebased on the latest dev branch, draft * tp enabled, draft, commited for multi-gpu test * tp&dp support works now on test_kvmanager * quick fix of multi_process test, the return masks is ok now, results not accurately verified * some small fixes, deal with empty graphs * debug for tp and dp (#15) Co-authored-by: zuogan <zuogan@tencent.com> * fix corner case && add pytest (#16) * support split used radix-node * add pytest * add exceptions && unit-tests * Standardize import & add .so to system path to avoid export manaually * refactor transferWorkers * quick fix * add dtype into the modelConfig so that we can support different dtypes * use ordered_dict to implement a expiring dict * quick fix * quick fix to build.sh * refactor get/put (#19) * refactor get/put * deal with no_space error * local after get/put impl * test kvmanager in pytest * fix test * Support (async) query interface. (#20) * Support (async) query interface * fix import * replace 'try_wait_at_layer_group' with 'query_at_layer_group' * rename --------- Co-authored-by: 869974612@qq.com <scutizhang@tencent.com> * in the return mask of put op, only set tokens that are really transfered as True * use GPUCPUTransferWorker when the tp size is only 1 & fix a bug of test script * add pytest unit for transfer_engine * add mypy (#22) * mypy update * refactor of transfer worker * refactor and fix mypy error * fix * update pyproject.toml * fix tests * fix insert return * rename logger --------- Co-authored-by: linhu-nv <linhu@nvidia.com> * refactor tensor handle (#25) * small fixes * now we accept gpu_kv_layout as a parameter, and automically infer other layouts in storage engine * fix of kvlayout generation, manually launch transfer_engine * quick fix * add mla support * refactor graph generation (#28) * change remote cache: (#30) add mla support for remote cache use truncate function instead of write when init to change file size add remote file_size config, can config file size in remote cache init file in storage engine add sequential mode(compared with round_roubin) and use in remote cache todo: pytest not support pcfs, pcfs init's thread cannot be found in pytest, use python instead remote cache async read write. is_mla test Co-authored-by: leolingli <leolingli@tencent.com> * quick fix for mla + tp * support both layer-wise and block-wise storage for cpu-blocks (#29) * support both layer-wise and block-wise storage for cpu-blocks * optimize multi-ssd rw * support ssd-cpu blockwise transfer * refactor kvlayout * fix config * fix bugs of mla+tp and pytest --------- Co-authored-by: zhuofanl <zhuofanl@nvidia.com> Co-authored-by: root <root@H20-GPU-10.cm.cluster> * quick fix for len(token_ids) < tokens_per_blocks and len(True in mask) < token_per_block (#33) * skip unready block in put (#34) * io_uring support for ssd cache (#32) * io_uring support for ssd cache Add io_uring support for ssd cache to accelerate io performance. Signed-off-by: charliecgxu <charliecgxu@tencent.com> * tests: adding test cases for iouring adding cases for iouring function/performance testing. Signed-off-by: charliecgxu <charliecgxu@tencent.com> * tests: adding test cases to test_kvmanager.py adding iouring test cases to test_kvmanager.py as well. Signed-off-by: charliecgxu <charliecgxu@tencent.com> --------- Signed-off-by: charliecgxu <charliecgxu@tencent.com> * add tracer that can record all requests of flexKV, and the replay script (#36) * fix to radixtree (#38) * repair kvmanager verify logic to fit no remote cache situation (#39) change cpu_tensor_ptr to run remote cache * quick fix for radix tree (#40) * add header file dependency (#41) Signed-off-by: charliecgxu <charliecgxu@tencent.com> * fix precommit format issues (#42) * change try_wait and only return the finished request (#43) * use numpy instead of tensor for zmq communication (#45) * use fadvise correctly (#44) Signed-off-by: charliecgxu <charliecgxu@tencent.com> * task id from client for faster get/put_async (#47) * fix some improper variable names (#48) Signed-off-by: charliecgxu <charliecgxu@tencent.com> * add config (#46) * more config for gpu transfer * configure max blocks per file * spread IO to as many files as possible (#49) Signed-off-by: charliecgxu <charliecgxu@tencent.com> * refactor pytest: add test_utils; add server-client mode * print input parameters correctly in the error case of iouring Signed-off-by: charliecgxu <charliecgxu@tencent.com> * remove the restriction of pin memory for iouring Currently, we do not register io buffers for iouring, so there is no restriction of pin memory. Signed-off-by: charliecgxu <charliecgxu@tencent.com> * add unit benchmark for workers (#52) * worker unit benchmark * fix * format adjusted; delete two test scripts * remove direct flag for ssd write The write performance of SSD is much worse than the read performance, so remove O_DIRECT flag when doing write operation. Signed-off-by: charliecgxu <charliecgxu@tencent.com> * refactor benchmark_cache_engine and fix some issues (#55) * update * fix tests * rename utils.py * fix bug in benchmark * refactor random request generation * fix rebase error * avoid opening ssd files per io request Signed-off-by: charliecgxu <charliecgxu@tencent.com> * fix bug, ssd_layer_stride_in_bytes compute error * add e2e benchmark for kvmanager (#58) * fix default params * add server_schedular for reduce process communication overhead (#59) * Unify the interface of the flexkv server (#60) * select direct_io fds only in read && direct mode Signed-off-by: charliecgxu <charliecgxu@tencent.com> * integrate expiring_dict * use Pipe instead of Queue for comm in transferEngine and worker * limit the id range * fallback to preadv/pwritev when iouring inflight request over limit (#64) Signed-off-by: charliecgxu <charliecgxu@tencent.com> * reduce bubble between op launch (#65) * some small fixes (#66) * remove worker_init_timeout_minutes (#67) * quick fix that the is_mla is not given to tp_client * add a message when error (#71) 1. fix bug, need return torch.empty(0, dtype = bool) or return float will cause vllm int add float problem 2. delete info in client for performance 3. add is_ready for client to determine whether flexkv is ready * radix tree c++ impl (#70) * radix tree implementation in c++ Signed-off-by: charliecgxu <charliecgxu@tencent.com> * support new radix-tree in cache engine Signed-off-by: charliecgxu <charliecgxu@tencent.com> --------- Signed-off-by: charliecgxu <charliecgxu@tencent.com> * sync kernel launch * kvmanager refactor (#73) * add KVCacheEngineClient APIs * basic implementation for KVCacheEngineClient * initial transfer manager * init transfer handle * init kv engine * refactor kvmanager * update kvmanager * some refactor * kv response * add benchmark * serialize graph * fix bugs * ready check * update * rename * rename benchmark * use numpy instead of tensor * small fix * remove transfer descriptor * rename to kvmanager * update api * add gpu-kvcache-verifier, draft * update * create a new tp-worker process and create gpu blocks for verification * rename * the test_kvmanager works now * fix virtual op initialize * fix verifier bug when tp > 1 and mla enabled * fix * remove task id && some fix * only create one h2d op * pass slotmapping for launch * quick fix --------- Co-authored-by: linhu-nv <linhu@nvidia.com> Co-authored-by: Fei Liang <hanyueh@nvidia.com> * feat: add support release wheel (#77) * feat: add support release wheel Signed-off-by: lilgao <lilgao@tencent.com> * fix copilot review for ci Signed-off-by: lilgao <lilgao@tencent.com> --------- Signed-off-by: lilgao <lilgao@tencent.com> Co-authored-by: lilgao <lilgao@tencent.com> * add evict_ratio in cache config, default is 0 evict number is max( int(mempool.num_total_blocks*evict_ratio), former evict number ) * update unit tests for new version (#79) * update test cache engine * update test cache engine accel * remove some tests * rename functions * enable profile in release build Signed-off-by: charliecgxu <charliecgxu@tencent.com> * update benchmark worker (#82) * update benchmark worker * status map * enable nvtx * fix default config * clear cpu to test ssd cache * ci: trigger on main and dev Signed-off-by: lilgao <lilgao@tencent.com> * fix broken cpp radix tree support for cache engine (#84) * adjust index accel to new cache engine data struct Signed-off-by: charliecgxu <charliecgxu@tencent.com> * fix broken tests for cache engine Signed-off-by: charliecgxu <charliecgxu@tencent.com> --------- Signed-off-by: charliecgxu <charliecgxu@tencent.com> * fix direct io * Using ring buffer in transfer engine to manage the src and dst block ids instead of using pin memory function inner the launch kernel for reducing the bubble * quickfix for return type of reduce_tensor * fix bug * refine ring_buffer and apply it to all workers * rename PinnedMemoryRing to SharedMemoryRing * fix status bug * allow to exceed the max_block_num * refactor: use hash to allocate buffer && no wait for free slot * add gds * fix batch sync * add gds transfer worker support * op-level callback * fix bugs * support dp > 1 while integrated with vllm * avoid redundant d2h data transfer for mla in tp * add gds worker & test * gdsput changed to original ssdtransfer * fix callback bug * add gds docs * remove redundant code * Tp16 support (#26) * initial tp16 support * avoid global mp context setting * model_config for transfer * configured by master node * build flag & assert * refactor gds transfer thread * [feature] Different gtensor layouts (#27) * support both vllm gpu tensor layouts and trtllm gpu tensor layouts * fix some small bugs * use template specialization to support different gpu tensor layouts * skip gds tests when not intended * add test for sglang --------- Co-authored-by: Fei Liang <feliang@nvidia.com> Co-authored-by: root <root@H20-GPU-10.cm.cluster> Co-authored-by: annz <annz@nvidia.com> * quick fix to benchmark_worker (#31) Co-authored-by: annz <annz@nvidia.com> * [feature] optimize SSD I/O (#33) * blockfirst ssd io * set io_uring flag=1 * batch sync for iouring * bench bidirection transfer * swap loop to improve multi-SSD bandwidth * prefer read * [Feature] Implement grace time with hit reward for cache node This commit introduces a grace time mechanism with hit reward seconds for cache nodes in the radix tree index. Key changes include: - Replace last_access_time with grace_time in RadixNode - Add hit_reward_seconds parameter to CacheEngine and RadixTreeIndex - Update node time calculation: on cache hit, extend grace time by hit_reward_seconds instead of resetting to current time - Add hit_reward_seconds configuration option in CacheConfig The new mechanism helps prioritize frequently accessed nodes in cache eviction by rewarding hits with extended grace periods. Signed-off-by: charliecgxu <charliecgxu@tencent.com> * [Feature] Extend grace time with hit reward to accelerated index This commit extends the grace time with hit reward mechanism to the accelerated C++ radix tree implementation. Key changes include: - Add hit_reward_seconds parameter to CRadixTreeIndex constructor - Modify CRadixNode to use grace_time instead of last_access_time - Implement same time update logic: extend grace time by hit_reward_seconds on cache hits rather than resetting to current time - Update Python bindings to pass hit_reward_seconds to C++ index Ensures consistent cache behavior between Python and accelerated C++ implementations. Signed-off-by: charliecgxu <charliecgxu@tencent.com> * quick fix of match_prefix * [misc] Replace std::map with std::unordered_map in RadixTree Signed-off-by: Zhaohu Xing <x.zhaohu@gmail.com> * [feature] simplify user configuration (#37) * add global config from env * use config from env * simplify port config * remove max_req_tokens * simple user config * update flexkv_config * fix benchmark * remove unused example * modify config doc * fix iouring flag && allow user config override env * update all docs * rename layout type * small fix * update tracer * adjust ssd blocks num if necessary * fix broken tests --------- Co-authored-by: linhu-nv <linhu@nvidia.com> * quick fix of config (#43) Co-authored-by: jianyingzhu <joeyzhu@nvidia.com> * [feature] GDS refactor & gtensor support (#42) * [refactor] gds reuse ssd handle * refactor tpGDS * minor naming changed * gtensor gds support * update doc & test * Support construct TensorSharedHandle directly from CUDA IPC Handle * add test file for TensorSharedHandle * add scripts for vllm adapter * [bugfix] fix port (#45) * [bugfix] fix ssd allocator (#46) * [bug fix] fix some bugs && cleanup code (#49) * fix benchmark * fix incorrect MatchResult * use int64_t for offset * fix bugs && update docs * update config file * fix env name * remove useless exceptions * quick fix * Support using FlexKV on TensorRT-LLM (#48) * prevent automatically initializing MPI * disable auto-mpi-init * Init support for TensorRT-LLM * add scripts * fix import and interface * support the trtllm gpu layout and improve register api of trt_adapter * modify log * modify scripts * use remote transfermanager * some fix by hulin * using subprocess instead of multiprocessing * fix dead lock * fix some bugs about gpu_register_port * fix tensor export * fix head_size * fix num_kv_heads for deepseek * fix ipc open error * fix head_size calculationg error * fix interface * fix get num_matched_tokens from trtllm * fix head_size calculationg error * fix interface * fix short len * remove code * add patch file * modify scripts * tensorRT LLM will wait until kvmanager isready * [bugfix] fix token alignment issue in tensorrt-llm by rounding down to block size * trivial * support flexkv + cuda graph using flexkv * modify patch * modify scripts * [bugfix] fix some bug fix bug * fix redix_tree * modify scripts * add debug log * modify scripts * fix rebase error * fix radix tree * fix scripts * use new config * rename * fix script * add branch for calculation of aligned_length * add branch for remote_process * take another way to determine branch * fux scripts * remove useless env and config * remove useless commit --------- Co-authored-by: zhuofan1123 <zhuofanl@nvidia.com> Co-authored-by: linhu-nv <linhu@nvidia.com> Co-authored-by: Luis-xu <hfutxjn@163.com> Co-authored-by: annz <annz@nvidia.com> Co-authored-by: leolingli <leolingli@tencent.com> * Rebase and merge bugfix to dev (#51) * [bugfix] fix for deepseek head number wrong * [bugfix] fix bug, if cpu match len is bigger than ssd when put it will cause error * fix redix_tree (#39) * fix empty --------- Co-authored-by: leolingli <leolingli@tencent.com> * quick fix * Fix bug found by unit test (#55) * Add patch and doc for trtllm (#52) * add patch * init docs * fin readme * rename yml * fix readme * fix readme * update docs * fix docs * fix docs * fix docs * add title * add readme_en * Update docs/trtllm_adaption/README_en.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: zhuofan1123 <zhuofanl@nvidia.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * [bugfix] put trtllm env set before kvmanager init (#58) * [feature] support Tp16 for vllm+flexkv (#59) * fix bug for tp16 * update vllm adapter to support tp16 * quick fix for tp16 (#62) --------- Signed-off-by: charliecgxu <charliecgxu@tencent.com> Signed-off-by: lilgao <lilgao@tencent.com> Signed-off-by: Zhaohu Xing <x.zhaohu@gmail.com> Co-authored-by: linhu-nv <linhu@nvidia.com> Co-authored-by: zhuofanl <zhuofanl@nvidia.com> Co-authored-by: PY <peiyuanz@nvidia.com> Co-authored-by: menyu <menyu@H20-GPU-05.cm.cluster> Co-authored-by: linhu-nv <141609318+linhu-nv@users.noreply.github.com> Co-authored-by: Zuo Gan <106919589+gz944367214@users.noreply.github.com> Co-authored-by: zuogan <zuogan@tencent.com> Co-authored-by: Rongwei Zhang <34190091+axxx03@users.noreply.github.com> Co-authored-by: 869974612@qq.com <scutizhang@tencent.com> Co-authored-by: root <root@H20-GPU-10.cm.cluster> Co-authored-by: charliecgxu <72770768+charliecgxu@users.noreply.github.com> Co-authored-by: Fei Liang <hanyueh@nvidia.com> Co-authored-by: charliecgxu <charliecgxu@tencent.com> Co-authored-by: moritzxu <moritzxu@tencent.com> Co-authored-by: Peng Gao <peng.gao.dut@gmail.com> Co-authored-by: lilgao <lilgao@tencent.com> Co-authored-by: jianyingzhu <53300651@qq.com> Co-authored-by: wenpengw-nv <wenpengw@nvidia.com> Co-authored-by: Fei Liang <feliang@nvidia.com> Co-authored-by: annz <annz@nvidia.com> Co-authored-by: Zhaohu Xing <x.zhaohu@gmail.com> Co-authored-by: Jianying <53503712+jianyingzhu@users.noreply.github.com> Co-authored-by: jianyingzhu <joeyzhu@nvidia.com> Co-authored-by: Luis-xu <hfutxjn@163.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This PR adds GPU Direct Storage (GDS) support to FlexKV, enabling direct data transfer between GPU and SSD.
flexkv/cache/cache_engine.py) and addition of GDS Transfer Workers