[feature]Add GDS (GPU Direct Storage) Support by wenpengw-nv · Pull Request #25 · taco-project/FlexKV

wenpengw-nv · 2025-10-21T03:41:52Z

This PR adds GPU Direct Storage (GDS) support to FlexKV, enabling direct data transfer between GPU and SSD.

GDS Integration

Put Path (GPU → CPU → SSD)/Get Path(SSD->GPU):To release GPU resources as quickly as possible, put data use a two-step approach: GPU → CPU → SSD. This allows GPU memory to be freed after the CPU copy completes, without waiting for slower disk I/O. Get data operations leverage direct GDS → GPU transfer
Main modifications to Cache Engine (flexkv/cache/cache_engine.py) and addition of GDS Transfer Workers
Functionally Ready: This PR provides a complete, working implementation of GDS support. Further performance tuning and optimization needed.

Document

User Guide for configuration requirements for GDS (privileged mode, device mounting)
Provided reference Docker run command with necessary capabilities

…ids instead of using pin memory function inner the launch kernel for reducing the bubble

…e_fix quickfix for return type of reduce_tensor

…nager

fix bug

…_buffer_manager Main process manages the ring buffer of block ids

linhu-nv

@wenpengw-nv can you please see my comments and address them? thanks

linhu-nv · 2025-10-23T07:29:14Z

csrc/gds/tp_gds_transfer_thread_group.cpp

+  threads_.reserve(num_gpus_);
+
+  for (int i = 0; i < num_gpus_; ++i) {
+    threads_.emplace_back([&, i]() {


the latest "tp_transfer_thread_group.cpp" now doesn't create dynamic threads in each transfer, instead, it can create a thread pool in the initilization, and then reuse all threads created before, which is much faster. Can you please help to refactor this as "tp_transfer_thread_group.cpp" does? thanks.

Already refactored to thread pool pattern as tp_transfer_thread_group.cpp. Please review.

linhu-nv · 2025-10-23T08:46:13Z

flexkv/kvtask.py

-        if not cache_config.enable_cpu and not cache_config.use_gds:
-            raise ValueError("use_gds must be True if enable_cpu is False")
+        if not cache_config.enable_cpu and not cache_config.enable_gds:
+            raise ValueError("enable_gds must be True if enable_cpu is False")


maybe we can add an assertion that enable_ssd and enable_gds cannot be used at the same time

linhu-nv · 2025-10-23T09:10:51Z

setup.py


-extra_link_args = ["-lcuda", "-lxxhash", "-lpthread", "-lrt", "-luring"]
-extra_compile_args = ["-std=c++17"]
+extra_link_args = ["-lcuda", "-lxxhash", "-lpthread", "-lrt", "-luring", "-lcufile"]


We should allow flexKV to run without GDS and GDS depencencies such as cufile. So, maybe we can also decide whether to add gds-related compile or link flags according to a environmental var such as "ENABLE_GDS". Only when this env var is true, we add those flags.

Done. Added FLEXKV_ENABLE_GDS environment var

peaceforeverCN · 2025-10-29T03:41:36Z

lgtm

* fix logic problems in client * quick fix to client-server example * fix some bugs to run compile * modify blockmeta definition & impl mempool * impl index * add index benchmark * faster hash * add test for storage_engine + transfer_engine * optim index * fix a few bugs about performance, should have normal perf now * add mempool benchmark * optim mempool * optimize index.insert * fix mempool * eviction implementation and optimization * fix evictor * flatten free ids tensor * impl get/put pipeline * add reset for cache engine * global id allocator * list to tensor * init kvmanager * run the pipeline * print cpu-gpu transfer info * refactor index * refactor kvmanager and cache engine * add insert_length for insert * cleanup buffer * Layer wise * Added the xxhash * add cmake * adapt use xxhash * fix bug * refactor * fix * remove comment * add benchmark and refactor code * fix bug * radix tree * update * update * fix * quick fix to ssd file allocation * remove batch transfer * quick fix to transfer_op initialization * check tensor type * support ce transfer * modify graph creation * remove dataclass * fix to illegal memory access * fix to transfer_op default values * remove gitlab-ci yml * submodule install automatically * init README * add cython * use multiple processes for transfer workers to bypass GIL * fix multi queue import * quick fix * support different layout && multiple ssd (#3) * add debug mode * default to debug * default to debug * NVTX Integration && I/O Optimization (#6) * fix * add nvtx * fix from_numpy * include remote cache engine, use ssd as fake remote file for test * some case works now * fix * support parallel remote read/write * add nvtx info * cpu-only * reduce polling sleep time * remove unused link * Support for multi-SSD read/write and round-robin layout (#8) * support multi-ssd * clear gpu blocks after put * fix permission denied * split cpp * update * share tensor (#9) * test tensor sharing by zmq (#10) * fix return * test zmq * Implement Server Client Mode (#11) Currently not support DP * merge leolingli/dev-2 to dev (#13) * add remote cache and pcfs this version only realize pcfs and reserve the following optimization: 1. pcfs should be a class of remote cache, not the only solution 2. pcfs arg need configurable 3. pcfs need support multi file * add remote cache multi file add customer args for remote cache and add pcfs args change Chinese comment to English move pcfs.c & .h to pcfs dir add test_remote_kvmanager for remote test --------- Co-authored-by: leolingli <leolingli@tencent.com> * add FLEXKV_ENABLE_CFS env var to control build/link flags (#14) * some optimizations about transfer graph and task status trackers * rebased on the latest dev branch, draft * tp enabled, draft, commited for multi-gpu test * tp&dp support works now on test_kvmanager * quick fix of multi_process test, the return masks is ok now, results not accurately verified * some small fixes, deal with empty graphs * debug for tp and dp (#15) Co-authored-by: zuogan <zuogan@tencent.com> * fix corner case && add pytest (#16) * support split used radix-node * add pytest * add exceptions && unit-tests * Standardize import & add .so to system path to avoid export manaually * refactor transferWorkers * quick fix * add dtype into the modelConfig so that we can support different dtypes * use ordered_dict to implement a expiring dict * quick fix * quick fix to build.sh * refactor get/put (#19) * refactor get/put * deal with no_space error * local after get/put impl * test kvmanager in pytest * fix test * Support (async) query interface. (#20) * Support (async) query interface * fix import * replace 'try_wait_at_layer_group' with 'query_at_layer_group' * rename --------- Co-authored-by: 869974612@qq.com <scutizhang@tencent.com> * in the return mask of put op, only set tokens that are really transfered as True * use GPUCPUTransferWorker when the tp size is only 1 & fix a bug of test script * add pytest unit for transfer_engine * add mypy (#22) * mypy update * refactor of transfer worker * refactor and fix mypy error * fix * update pyproject.toml * fix tests * fix insert return * rename logger --------- Co-authored-by: linhu-nv <linhu@nvidia.com> * refactor tensor handle (#25) * small fixes * now we accept gpu_kv_layout as a parameter, and automically infer other layouts in storage engine * fix of kvlayout generation, manually launch transfer_engine * quick fix * add mla support * refactor graph generation (#28) * change remote cache: (#30) add mla support for remote cache use truncate function instead of write when init to change file size add remote file_size config, can config file size in remote cache init file in storage engine add sequential mode(compared with round_roubin) and use in remote cache todo: pytest not support pcfs, pcfs init's thread cannot be found in pytest, use python instead remote cache async read write. is_mla test Co-authored-by: leolingli <leolingli@tencent.com> * quick fix for mla + tp * support both layer-wise and block-wise storage for cpu-blocks (#29) * support both layer-wise and block-wise storage for cpu-blocks * optimize multi-ssd rw * support ssd-cpu blockwise transfer * refactor kvlayout * fix config * fix bugs of mla+tp and pytest --------- Co-authored-by: zhuofanl <zhuofanl@nvidia.com> Co-authored-by: root <root@H20-GPU-10.cm.cluster> * quick fix for len(token_ids) < tokens_per_blocks and len(True in mask) < token_per_block (#33) * skip unready block in put (#34) * io_uring support for ssd cache (#32) * io_uring support for ssd cache Add io_uring support for ssd cache to accelerate io performance. Signed-off-by: charliecgxu <charliecgxu@tencent.com> * tests: adding test cases for iouring adding cases for iouring function/performance testing. Signed-off-by: charliecgxu <charliecgxu@tencent.com> * tests: adding test cases to test_kvmanager.py adding iouring test cases to test_kvmanager.py as well. Signed-off-by: charliecgxu <charliecgxu@tencent.com> --------- Signed-off-by: charliecgxu <charliecgxu@tencent.com> * add tracer that can record all requests of flexKV, and the replay script (#36) * fix to radixtree (#38) * repair kvmanager verify logic to fit no remote cache situation (#39) change cpu_tensor_ptr to run remote cache * quick fix for radix tree (#40) * add header file dependency (#41) Signed-off-by: charliecgxu <charliecgxu@tencent.com> * fix precommit format issues (#42) * change try_wait and only return the finished request (#43) * use numpy instead of tensor for zmq communication (#45) * use fadvise correctly (#44) Signed-off-by: charliecgxu <charliecgxu@tencent.com> * task id from client for faster get/put_async (#47) * fix some improper variable names (#48) Signed-off-by: charliecgxu <charliecgxu@tencent.com> * add config (#46) * more config for gpu transfer * configure max blocks per file * spread IO to as many files as possible (#49) Signed-off-by: charliecgxu <charliecgxu@tencent.com> * refactor pytest: add test_utils; add server-client mode * print input parameters correctly in the error case of iouring Signed-off-by: charliecgxu <charliecgxu@tencent.com> * remove the restriction of pin memory for iouring Currently, we do not register io buffers for iouring, so there is no restriction of pin memory. Signed-off-by: charliecgxu <charliecgxu@tencent.com> * add unit benchmark for workers (#52) * worker unit benchmark * fix * format adjusted; delete two test scripts * remove direct flag for ssd write The write performance of SSD is much worse than the read performance, so remove O_DIRECT flag when doing write operation. Signed-off-by: charliecgxu <charliecgxu@tencent.com> * refactor benchmark_cache_engine and fix some issues (#55) * update * fix tests * rename utils.py * fix bug in benchmark * refactor random request generation * fix rebase error * avoid opening ssd files per io request Signed-off-by: charliecgxu <charliecgxu@tencent.com> * fix bug, ssd_layer_stride_in_bytes compute error * add e2e benchmark for kvmanager (#58) * fix default params * add server_schedular for reduce process communication overhead (#59) * Unify the interface of the flexkv server (#60) * select direct_io fds only in read && direct mode Signed-off-by: charliecgxu <charliecgxu@tencent.com> * integrate expiring_dict * use Pipe instead of Queue for comm in transferEngine and worker * limit the id range * fallback to preadv/pwritev when iouring inflight request over limit (#64) Signed-off-by: charliecgxu <charliecgxu@tencent.com> * reduce bubble between op launch (#65) * some small fixes (#66) * remove worker_init_timeout_minutes (#67) * quick fix that the is_mla is not given to tp_client * add a message when error (#71) 1. fix bug, need return torch.empty(0, dtype = bool) or return float will cause vllm int add float problem 2. delete info in client for performance 3. add is_ready for client to determine whether flexkv is ready * radix tree c++ impl (#70) * radix tree implementation in c++ Signed-off-by: charliecgxu <charliecgxu@tencent.com> * support new radix-tree in cache engine Signed-off-by: charliecgxu <charliecgxu@tencent.com> --------- Signed-off-by: charliecgxu <charliecgxu@tencent.com> * sync kernel launch * kvmanager refactor (#73) * add KVCacheEngineClient APIs * basic implementation for KVCacheEngineClient * initial transfer manager * init transfer handle * init kv engine * refactor kvmanager * update kvmanager * some refactor * kv response * add benchmark * serialize graph * fix bugs * ready check * update * rename * rename benchmark * use numpy instead of tensor * small fix * remove transfer descriptor * rename to kvmanager * update api * add gpu-kvcache-verifier, draft * update * create a new tp-worker process and create gpu blocks for verification * rename * the test_kvmanager works now * fix virtual op initialize * fix verifier bug when tp > 1 and mla enabled * fix * remove task id && some fix * only create one h2d op * pass slotmapping for launch * quick fix --------- Co-authored-by: linhu-nv <linhu@nvidia.com> Co-authored-by: Fei Liang <hanyueh@nvidia.com> * feat: add support release wheel (#77) * feat: add support release wheel Signed-off-by: lilgao <lilgao@tencent.com> * fix copilot review for ci Signed-off-by: lilgao <lilgao@tencent.com> --------- Signed-off-by: lilgao <lilgao@tencent.com> Co-authored-by: lilgao <lilgao@tencent.com> * add evict_ratio in cache config, default is 0 evict number is max( int(mempool.num_total_blocks*evict_ratio), former evict number ) * update unit tests for new version (#79) * update test cache engine * update test cache engine accel * remove some tests * rename functions * enable profile in release build Signed-off-by: charliecgxu <charliecgxu@tencent.com> * update benchmark worker (#82) * update benchmark worker * status map * enable nvtx * fix default config * clear cpu to test ssd cache * ci: trigger on main and dev Signed-off-by: lilgao <lilgao@tencent.com> * fix broken cpp radix tree support for cache engine (#84) * adjust index accel to new cache engine data struct Signed-off-by: charliecgxu <charliecgxu@tencent.com> * fix broken tests for cache engine Signed-off-by: charliecgxu <charliecgxu@tencent.com> --------- Signed-off-by: charliecgxu <charliecgxu@tencent.com> * fix direct io * Using ring buffer in transfer engine to manage the src and dst block ids instead of using pin memory function inner the launch kernel for reducing the bubble * quickfix for return type of reduce_tensor * fix bug * refine ring_buffer and apply it to all workers * rename PinnedMemoryRing to SharedMemoryRing * fix status bug * allow to exceed the max_block_num * refactor: use hash to allocate buffer && no wait for free slot * add gds * fix batch sync * add gds transfer worker support * op-level callback * fix bugs * support dp > 1 while integrated with vllm * avoid redundant d2h data transfer for mla in tp * add gds worker & test * gdsput changed to original ssdtransfer * fix callback bug * add gds docs * remove redundant code * Tp16 support (#26) * initial tp16 support * avoid global mp context setting * model_config for transfer * configured by master node * build flag & assert * refactor gds transfer thread * [feature] Different gtensor layouts (#27) * support both vllm gpu tensor layouts and trtllm gpu tensor layouts * fix some small bugs * use template specialization to support different gpu tensor layouts * skip gds tests when not intended * add test for sglang --------- Co-authored-by: Fei Liang <feliang@nvidia.com> Co-authored-by: root <root@H20-GPU-10.cm.cluster> Co-authored-by: annz <annz@nvidia.com> * quick fix to benchmark_worker (#31) Co-authored-by: annz <annz@nvidia.com> * [feature] optimize SSD I/O (#33) * blockfirst ssd io * set io_uring flag=1 * batch sync for iouring * bench bidirection transfer * swap loop to improve multi-SSD bandwidth * prefer read * [Feature] Implement grace time with hit reward for cache node This commit introduces a grace time mechanism with hit reward seconds for cache nodes in the radix tree index. Key changes include: - Replace last_access_time with grace_time in RadixNode - Add hit_reward_seconds parameter to CacheEngine and RadixTreeIndex - Update node time calculation: on cache hit, extend grace time by hit_reward_seconds instead of resetting to current time - Add hit_reward_seconds configuration option in CacheConfig The new mechanism helps prioritize frequently accessed nodes in cache eviction by rewarding hits with extended grace periods. Signed-off-by: charliecgxu <charliecgxu@tencent.com> * [Feature] Extend grace time with hit reward to accelerated index This commit extends the grace time with hit reward mechanism to the accelerated C++ radix tree implementation. Key changes include: - Add hit_reward_seconds parameter to CRadixTreeIndex constructor - Modify CRadixNode to use grace_time instead of last_access_time - Implement same time update logic: extend grace time by hit_reward_seconds on cache hits rather than resetting to current time - Update Python bindings to pass hit_reward_seconds to C++ index Ensures consistent cache behavior between Python and accelerated C++ implementations. Signed-off-by: charliecgxu <charliecgxu@tencent.com> * quick fix of match_prefix * [misc] Replace std::map with std::unordered_map in RadixTree Signed-off-by: Zhaohu Xing <x.zhaohu@gmail.com> * [feature] simplify user configuration (#37) * add global config from env * use config from env * simplify port config * remove max_req_tokens * simple user config * update flexkv_config * fix benchmark * remove unused example * modify config doc * fix iouring flag && allow user config override env * update all docs * rename layout type * small fix * update tracer * adjust ssd blocks num if necessary * fix broken tests --------- Co-authored-by: linhu-nv <linhu@nvidia.com> * quick fix of config (#43) Co-authored-by: jianyingzhu <joeyzhu@nvidia.com> * [feature] GDS refactor & gtensor support (#42) * [refactor] gds reuse ssd handle * refactor tpGDS * minor naming changed * gtensor gds support * update doc & test * Support construct TensorSharedHandle directly from CUDA IPC Handle * add test file for TensorSharedHandle * add scripts for vllm adapter * [bugfix] fix port (#45) * [bugfix] fix ssd allocator (#46) * [bug fix] fix some bugs && cleanup code (#49) * fix benchmark * fix incorrect MatchResult * use int64_t for offset * fix bugs && update docs * update config file * fix env name * remove useless exceptions * quick fix * Support using FlexKV on TensorRT-LLM (#48) * prevent automatically initializing MPI * disable auto-mpi-init * Init support for TensorRT-LLM * add scripts * fix import and interface * support the trtllm gpu layout and improve register api of trt_adapter * modify log * modify scripts * use remote transfermanager * some fix by hulin * using subprocess instead of multiprocessing * fix dead lock * fix some bugs about gpu_register_port * fix tensor export * fix head_size * fix num_kv_heads for deepseek * fix ipc open error * fix head_size calculationg error * fix interface * fix get num_matched_tokens from trtllm * fix head_size calculationg error * fix interface * fix short len * remove code * add patch file * modify scripts * tensorRT LLM will wait until kvmanager isready * [bugfix] fix token alignment issue in tensorrt-llm by rounding down to block size * trivial * support flexkv + cuda graph using flexkv * modify patch * modify scripts * [bugfix] fix some bug fix bug * fix redix_tree * modify scripts * add debug log * modify scripts * fix rebase error * fix radix tree * fix scripts * use new config * rename * fix script * add branch for calculation of aligned_length * add branch for remote_process * take another way to determine branch * fux scripts * remove useless env and config * remove useless commit --------- Co-authored-by: zhuofan1123 <zhuofanl@nvidia.com> Co-authored-by: linhu-nv <linhu@nvidia.com> Co-authored-by: Luis-xu <hfutxjn@163.com> Co-authored-by: annz <annz@nvidia.com> Co-authored-by: leolingli <leolingli@tencent.com> * Rebase and merge bugfix to dev (#51) * [bugfix] fix for deepseek head number wrong * [bugfix] fix bug, if cpu match len is bigger than ssd when put it will cause error * fix redix_tree (#39) * fix empty --------- Co-authored-by: leolingli <leolingli@tencent.com> * quick fix * Fix bug found by unit test (#55) * Add patch and doc for trtllm (#52) * add patch * init docs * fin readme * rename yml * fix readme * fix readme * update docs * fix docs * fix docs * fix docs * add title * add readme_en * Update docs/trtllm_adaption/README_en.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: zhuofan1123 <zhuofanl@nvidia.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * [bugfix] put trtllm env set before kvmanager init (#58) * [feature] support Tp16 for vllm+flexkv (#59) * fix bug for tp16 * update vllm adapter to support tp16 * quick fix for tp16 (#62) --------- Signed-off-by: charliecgxu <charliecgxu@tencent.com> Signed-off-by: lilgao <lilgao@tencent.com> Signed-off-by: Zhaohu Xing <x.zhaohu@gmail.com> Co-authored-by: linhu-nv <linhu@nvidia.com> Co-authored-by: zhuofanl <zhuofanl@nvidia.com> Co-authored-by: PY <peiyuanz@nvidia.com> Co-authored-by: menyu <menyu@H20-GPU-05.cm.cluster> Co-authored-by: linhu-nv <141609318+linhu-nv@users.noreply.github.com> Co-authored-by: Zuo Gan <106919589+gz944367214@users.noreply.github.com> Co-authored-by: zuogan <zuogan@tencent.com> Co-authored-by: Rongwei Zhang <34190091+axxx03@users.noreply.github.com> Co-authored-by: 869974612@qq.com <scutizhang@tencent.com> Co-authored-by: root <root@H20-GPU-10.cm.cluster> Co-authored-by: charliecgxu <72770768+charliecgxu@users.noreply.github.com> Co-authored-by: Fei Liang <hanyueh@nvidia.com> Co-authored-by: charliecgxu <charliecgxu@tencent.com> Co-authored-by: moritzxu <moritzxu@tencent.com> Co-authored-by: Peng Gao <peng.gao.dut@gmail.com> Co-authored-by: lilgao <lilgao@tencent.com> Co-authored-by: jianyingzhu <53300651@qq.com> Co-authored-by: wenpengw-nv <wenpengw@nvidia.com> Co-authored-by: Fei Liang <feliang@nvidia.com> Co-authored-by: annz <annz@nvidia.com> Co-authored-by: Zhaohu Xing <x.zhaohu@gmail.com> Co-authored-by: Jianying <53503712+jianyingzhu@users.noreply.github.com> Co-authored-by: jianyingzhu <joeyzhu@nvidia.com> Co-authored-by: Luis-xu <hfutxjn@163.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

linhu-nv and others added 30 commits May 14, 2025 15:52

fix of code format

cfa308f

fix bugs about ipc_memory_handle

d0881c0

fix logic problems in client

7d63547

quick fix to client-server example

dc00c77

fix some bugs to run compile

45bc9f8

modify blockmeta definition & impl mempool

4414e83

impl index

3cbb19d

add index benchmark

409ef8a

faster hash

7d90ee2

add test for storage_engine + transfer_engine

d4c5a39

optim index

ca90a73

fix a few bugs about performance, should have normal perf now

6489c3b

add mempool benchmark

0866932

optim mempool

9550fa5

optimize index.insert

15bfaff

fix mempool

cbea324

eviction implementation and optimization

117ad1a

fix evictor

69167f7

flatten free ids tensor

fcc4e87

impl get/put pipeline

be2737e

add reset for cache engine

7ae1876

global id allocator

143c261

list to tensor

0bff033

init kvmanager

7c66cc7

run the pipeline

69410b6

print cpu-gpu transfer info

e6ba7d8

refactor index

4d26d16

refactor kvmanager and cache engine

3bd3119

add insert_length for insert

5cf26ff

cleanup buffer

6ce0002

Luis-xu and others added 20 commits August 27, 2025 15:32

Using ring buffer in transfer engine to manage the src and dst block …

d9cdea5

…ids instead of using pin memory function inner the launch kernel for reducing the bubble

quickfix for return type of reduce_tensor

efc9da2

Merge pull request taco-project#87 from nvidia-china-sae/memory_handl…

8c1663b

…e_fix quickfix for return type of reduce_tensor

fix bug

5890621

Merge remote-tracking branch 'origin/dev' into main_process_buffer_ma…

545dc8b

…nager

refine ring_buffer and apply it to all workers

140aedc

rename PinnedMemoryRing to SharedMemoryRing

47fdd70

fix status bug

969e0bf

Merge pull request taco-project#89 from nvidia-china-sae/zuogan/dev

0876171

fix bug

allow to exceed the max_block_num

5ace688

refactor: use hash to allocate buffer && no wait for free slot

f5b6a94

Merge pull request taco-project#86 from nvidia-china-sae/main_process…

7d6232b

…_buffer_manager Main process manages the ring buffer of block ids

add gds

2d70104

fix batch sync

287f572

add gds transfer worker support

bdd3e0d

add gds worker & test

c65de2e

gdsput changed to original ssdtransfer

e12fe9c

Merge branch 'dev' into wenpengw/gds

0a3067b

fix callback bug

51e7fbf

add gds docs

3b71a34

wenpengw-nv changed the title ~~Add GDS (GPU Direct Storage) Support~~ [feature]Add GDS (GPU Direct Storage) Support Oct 21, 2025

remove redundant code

68bbb6f

linhu-nv requested changes Oct 23, 2025

View reviewed changes

wenpengw-nv added 2 commits October 27, 2025 09:50

build flag & assert

e61c6c6

refactor gds transfer thread

9c51cbb

linhu-nv requested review from peaceforeverCN and zhuofan1123 October 29, 2025 02:21

peaceforeverCN merged commit ba0a478 into taco-project:dev Oct 29, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[feature]Add GDS (GPU Direct Storage) Support#25

[feature]Add GDS (GPU Direct Storage) Support#25
peaceforeverCN merged 214 commits intotaco-project:devfrom
wenpengw-nv:wenpengw/gds

wenpengw-nv commented Oct 21, 2025

Uh oh!

linhu-nv left a comment

Uh oh!

linhu-nv Oct 23, 2025

Uh oh!

wenpengw-nv Oct 28, 2025

Uh oh!

linhu-nv Oct 23, 2025

Uh oh!

wenpengw-nv Oct 28, 2025

Uh oh!

linhu-nv Oct 23, 2025

Uh oh!

wenpengw-nv Oct 28, 2025

Uh oh!

peaceforeverCN commented Oct 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Comments

Conversation

wenpengw-nv commented Oct 21, 2025

Uh oh!

linhu-nv left a comment

Choose a reason for hiding this comment

Uh oh!

linhu-nv Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

wenpengw-nv Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

linhu-nv Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

wenpengw-nv Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

linhu-nv Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

wenpengw-nv Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

peaceforeverCN commented Oct 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants