[WIP][Core][RDT] Reuse nixl agent #60602

Chong-Li · 2026-01-30T06:33:43Z

Description

Reuse nixl remote agent to avoid cold start for each RDT.

Related issues

Additional information

gemini-code-assist

Code Review

This pull request introduces a caching mechanism for NIXL remote agents to optimize performance by reducing cold start times for RDT operations. Key changes include adding new metadata fields to NixlTransportMetadata, implementing an LRU cache for remote agents, and modifying the extract_tensor_transport_metadata, recv_multiple_tensors, and garbage_collect methods to integrate with this cache. A new constant NIXL_REMOTE_AGENT_CACHE_MAXSIZE is also added for configuring the cache size.

gemini-code-assist · 2026-01-30T06:35:26Z

python/ray/experimental/gpu_object_manager/nixl_tensor_transport.py

+                try:
+                    nixl_agent.remove_remote_agent(evicted_agent_name)
+                except Exception as e:
+                    print(f"Warning: Failed to remove remote agent: {e}")


The try-except Exception block is very broad. While it prevents crashes, it might hide specific issues during remove_remote_agent. It would be more robust to catch more specific exceptions if known, or at least log the full traceback for better debugging of potential problems with agent removal.

cursor

Cursor Bugbot has reviewed your changes and found 4 potential issues.

cursor · 2026-01-30T06:41:26Z

python/ray/experimental/gpu_object_manager/nixl_tensor_transport.py

+                    )
+                else:
+                    nixl_agent.remove_remote_agent(remote_name)
+                    nixl_agent.add_remote_agent(remote_nixl_agent_meta)


Return value ignored when re-adding remote agent

Medium Severity

When the remote agent is cached but nixl_agent_partial_meta is None, the code calls nixl_agent.remove_remote_agent(remote_name) followed by nixl_agent.add_remote_agent(remote_nixl_agent_meta) without capturing the return value. In contrast, when the agent is not cached (line 246), the return value is properly assigned to remote_name. The stale remote_name is then used in initialize_xfer, which could cause transfer failures if the actual agent name differs after re-adding.

cursor · 2026-01-30T06:41:26Z

python/ray/experimental/gpu_object_manager/nixl_tensor_transport.py

+                    nixl_agent.remove_remote_agent(evicted_agent_name)
+                except Exception as e:
+                    print(f"Warning: Failed to remove remote agent: {e}")
+            self._remote_agents[remote_name] = None


Missing lock protection for remote agent cache access

Medium Severity

The _remote_agents OrderedDict is accessed and modified without lock protection in _update_remote_agent_cache and recv_multiple_tensors. This is inconsistent with the class's pattern of using locks for other shared state (_aborted_transfer_obj_ids_lock, _nixl_memory_lock). Concurrent access could cause race conditions where an agent is evicted between the membership check and its use, or concurrent modifications corrupt the cache state.

Additional Locations (1)

python/ray/experimental/gpu_object_manager/nixl_tensor_transport.py#L236-L250

cursor · 2026-01-30T06:41:26Z

python/ray/experimental/gpu_object_manager/nixl_tensor_transport.py

+                    )
+                else:
+                    nixl_agent.remove_remote_agent(remote_name)
+                    nixl_agent.add_remote_agent(remote_nixl_agent_meta)


Cache becomes stale when remove-then-add operation fails

Medium Severity

When a cached agent needs refresh (nixl_agent_partial_meta is None), the code calls remove_remote_agent followed by add_remote_agent. If remove succeeds but add fails with an exception, the agent is removed from the nixl_agent but remains in _remote_agents cache. The finally block doesn't invalidate this stale cache entry. Subsequent transfers from the same sender will hit the stale cache entry and fail when trying to operate on the non-existent agent.

Additional Locations (1)

python/ray/experimental/gpu_object_manager/nixl_tensor_transport.py#L287-L296

cursor · 2026-01-30T06:41:26Z

python/ray/experimental/gpu_object_manager/nixl_tensor_transport.py

+                try:
+                    nixl_agent.remove_remote_agent(evicted_agent_name)
+                except Exception as e:
+                    print(f"Warning: Failed to remove remote agent: {e}")


Using print() instead of logging for warnings

Low Severity

The code uses print() for warning messages instead of proper logging. The sibling file gpu_object_manager.py in the same directory correctly uses logging.getLogger(__name__) for all log output. Using print() is inconsistent with the module's logging pattern, makes it harder to filter/configure log output, and bypasses the logging infrastructure used throughout Ray.

Chong-Li · 2026-01-30T06:46:33Z

python/ray/experimental/gpu_object_manager/nixl_tensor_transport.py

+                agent_meta = nixl_agent.get_agent_metadata()
+                agent_name = nixl_agent.name
+                if self._memory_deregistered:
+                    agent_partial_meta = None


If there was a deregister_memory called before, then we should not use the partial agent meta. Otherwise, when the receiver calls add_remote_agent(partial_agent_metadata), there would be a NIXL_ERR_NOT_ALLOWED thrown.

This is actually a TODO in nixl (search // TODO: Support metadata updates). When the receiver uses partial agent meta to update an existing remote agent (the sender), it will check the address of the newly-registered memory (which may reuse the address of a (previously) de-registered memory at the sender). If the address conflicts with any other memory (which has been de-registered at the sender but the receiver has not been notified), it throws NIXL_ERR_NOT_ALLOWED.

This means we can benefit from agent reuse only if there is no de-registration between consecutive transfers.

Chong-Li · 2026-01-30T06:49:29Z

python/ray/experimental/gpu_object_manager/nixl_tensor_transport.py

+                    nixl_agent.add_remote_agent(
+                        tensor_transport_metadata.nixl_agent_partial_meta
+                    )
+                else:


To avoid the NIXL_ERR_NOT_ALLOWD mentioned above, we have to remove and add remote agent (with full agent meta) here.

[Core][RDT] Reuse nixl agent

cc8eca5

Chong-Li requested a review from a team as a code owner January 30, 2026 06:33

gemini-code-assist bot reviewed Jan 30, 2026

View reviewed changes

cursor bot reviewed Jan 30, 2026

View reviewed changes

Chong-Li commented Jan 30, 2026

View reviewed changes

Chong-Li requested review from SongGuyang, dayshah and stephanie-wang January 30, 2026 06:56

ray-gardener bot added core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels Jan 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][Core][RDT] Reuse nixl agent #60602

[WIP][Core][RDT] Reuse nixl agent #60602

Chong-Li commented Jan 30, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 30, 2026

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Jan 30, 2026

Uh oh!

cursor bot Jan 30, 2026

Uh oh!

cursor bot Jan 30, 2026

Uh oh!

cursor bot Jan 30, 2026

Uh oh!

Chong-Li Jan 30, 2026 •

edited

Loading

Uh oh!

Chong-Li Jan 30, 2026 •

edited

Loading

Uh oh!

Chong-Li Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[WIP][Core][RDT] Reuse nixl agent #60602

Are you sure you want to change the base?

[WIP][Core][RDT] Reuse nixl agent #60602

Conversation

Chong-Li commented Jan 30, 2026

Description

Related issues

Additional information

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Jan 30, 2026

Choose a reason for hiding this comment

Return value ignored when re-adding remote agent

Uh oh!

cursor bot Jan 30, 2026

Choose a reason for hiding this comment

Missing lock protection for remote agent cache access

Uh oh!

cursor bot Jan 30, 2026

Choose a reason for hiding this comment

Cache becomes stale when remove-then-add operation fails

Uh oh!

cursor bot Jan 30, 2026

Choose a reason for hiding this comment

Using print() instead of logging for warnings

Uh oh!

Chong-Li Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Chong-Li Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Chong-Li Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Chong-Li Jan 30, 2026 •

edited

Loading

Chong-Li Jan 30, 2026 •

edited

Loading