Skip to content

Conversation

@Chong-Li
Copy link
Contributor

Description

Reuse nixl remote agent to avoid cold start for each RDT.

Related issues

#60524

Additional information

@Chong-Li Chong-Li requested a review from a team as a code owner January 30, 2026 06:33
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a caching mechanism for NIXL remote agents to optimize performance by reducing cold start times for RDT operations. Key changes include adding new metadata fields to NixlTransportMetadata, implementing an LRU cache for remote agents, and modifying the extract_tensor_transport_metadata, recv_multiple_tensors, and garbage_collect methods to integrate with this cache. A new constant NIXL_REMOTE_AGENT_CACHE_MAXSIZE is also added for configuring the cache size.

Comment on lines +118 to +121
try:
nixl_agent.remove_remote_agent(evicted_agent_name)
except Exception as e:
print(f"Warning: Failed to remove remote agent: {e}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The try-except Exception block is very broad. While it prevents crashes, it might hide specific issues during remove_remote_agent. It would be more robust to catch more specific exceptions if known, or at least log the full traceback for better debugging of potential problems with agent removal.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 4 potential issues.

)
else:
nixl_agent.remove_remote_agent(remote_name)
nixl_agent.add_remote_agent(remote_nixl_agent_meta)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return value ignored when re-adding remote agent

Medium Severity

When the remote agent is cached but nixl_agent_partial_meta is None, the code calls nixl_agent.remove_remote_agent(remote_name) followed by nixl_agent.add_remote_agent(remote_nixl_agent_meta) without capturing the return value. In contrast, when the agent is not cached (line 246), the return value is properly assigned to remote_name. The stale remote_name is then used in initialize_xfer, which could cause transfer failures if the actual agent name differs after re-adding.

Fix in Cursor Fix in Web

nixl_agent.remove_remote_agent(evicted_agent_name)
except Exception as e:
print(f"Warning: Failed to remove remote agent: {e}")
self._remote_agents[remote_name] = None
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing lock protection for remote agent cache access

Medium Severity

The _remote_agents OrderedDict is accessed and modified without lock protection in _update_remote_agent_cache and recv_multiple_tensors. This is inconsistent with the class's pattern of using locks for other shared state (_aborted_transfer_obj_ids_lock, _nixl_memory_lock). Concurrent access could cause race conditions where an agent is evicted between the membership check and its use, or concurrent modifications corrupt the cache state.

Additional Locations (1)

Fix in Cursor Fix in Web

)
else:
nixl_agent.remove_remote_agent(remote_name)
nixl_agent.add_remote_agent(remote_nixl_agent_meta)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cache becomes stale when remove-then-add operation fails

Medium Severity

When a cached agent needs refresh (nixl_agent_partial_meta is None), the code calls remove_remote_agent followed by add_remote_agent. If remove succeeds but add fails with an exception, the agent is removed from the nixl_agent but remains in _remote_agents cache. The finally block doesn't invalidate this stale cache entry. Subsequent transfers from the same sender will hit the stale cache entry and fail when trying to operate on the non-existent agent.

Additional Locations (1)

Fix in Cursor Fix in Web

try:
nixl_agent.remove_remote_agent(evicted_agent_name)
except Exception as e:
print(f"Warning: Failed to remove remote agent: {e}")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using print() instead of logging for warnings

Low Severity

The code uses print() for warning messages instead of proper logging. The sibling file gpu_object_manager.py in the same directory correctly uses logging.getLogger(__name__) for all log output. Using print() is inconsistent with the module's logging pattern, makes it harder to filter/configure log output, and bypasses the logging infrastructure used throughout Ray.

Fix in Cursor Fix in Web

agent_meta = nixl_agent.get_agent_metadata()
agent_name = nixl_agent.name
if self._memory_deregistered:
agent_partial_meta = None
Copy link
Contributor Author

@Chong-Li Chong-Li Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there was a deregister_memory called before, then we should not use the partial agent meta. Otherwise, when the receiver calls add_remote_agent(partial_agent_metadata), there would be a NIXL_ERR_NOT_ALLOWED thrown.

This is actually a TODO in nixl (search // TODO: Support metadata updates). When the receiver uses partial agent meta to update an existing remote agent (the sender), it will check the address of the newly-registered memory (which may reuse the address of a (previously) de-registered memory at the sender). If the address conflicts with any other memory (which has been de-registered at the sender but the receiver has not been notified), it throws NIXL_ERR_NOT_ALLOWED.

Copy link
Contributor Author

@Chong-Li Chong-Li Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means we can benefit from agent reuse only if there is no de-registration between consecutive transfers.

nixl_agent.add_remote_agent(
tensor_transport_metadata.nixl_agent_partial_meta
)
else:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid the NIXL_ERR_NOT_ALLOWD mentioned above, we have to remove and add remote agent (with full agent meta) here.

@ray-gardener ray-gardener bot added core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels Jan 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant