-
Notifications
You must be signed in to change notification settings - Fork 7.2k
[WIP][Core][RDT] Reuse nixl agent #60602
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a caching mechanism for NIXL remote agents to optimize performance by reducing cold start times for RDT operations. Key changes include adding new metadata fields to NixlTransportMetadata, implementing an LRU cache for remote agents, and modifying the extract_tensor_transport_metadata, recv_multiple_tensors, and garbage_collect methods to integrate with this cache. A new constant NIXL_REMOTE_AGENT_CACHE_MAXSIZE is also added for configuring the cache size.
| try: | ||
| nixl_agent.remove_remote_agent(evicted_agent_name) | ||
| except Exception as e: | ||
| print(f"Warning: Failed to remove remote agent: {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 4 potential issues.
| ) | ||
| else: | ||
| nixl_agent.remove_remote_agent(remote_name) | ||
| nixl_agent.add_remote_agent(remote_nixl_agent_meta) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Return value ignored when re-adding remote agent
Medium Severity
When the remote agent is cached but nixl_agent_partial_meta is None, the code calls nixl_agent.remove_remote_agent(remote_name) followed by nixl_agent.add_remote_agent(remote_nixl_agent_meta) without capturing the return value. In contrast, when the agent is not cached (line 246), the return value is properly assigned to remote_name. The stale remote_name is then used in initialize_xfer, which could cause transfer failures if the actual agent name differs after re-adding.
| nixl_agent.remove_remote_agent(evicted_agent_name) | ||
| except Exception as e: | ||
| print(f"Warning: Failed to remove remote agent: {e}") | ||
| self._remote_agents[remote_name] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing lock protection for remote agent cache access
Medium Severity
The _remote_agents OrderedDict is accessed and modified without lock protection in _update_remote_agent_cache and recv_multiple_tensors. This is inconsistent with the class's pattern of using locks for other shared state (_aborted_transfer_obj_ids_lock, _nixl_memory_lock). Concurrent access could cause race conditions where an agent is evicted between the membership check and its use, or concurrent modifications corrupt the cache state.
Additional Locations (1)
| ) | ||
| else: | ||
| nixl_agent.remove_remote_agent(remote_name) | ||
| nixl_agent.add_remote_agent(remote_nixl_agent_meta) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cache becomes stale when remove-then-add operation fails
Medium Severity
When a cached agent needs refresh (nixl_agent_partial_meta is None), the code calls remove_remote_agent followed by add_remote_agent. If remove succeeds but add fails with an exception, the agent is removed from the nixl_agent but remains in _remote_agents cache. The finally block doesn't invalidate this stale cache entry. Subsequent transfers from the same sender will hit the stale cache entry and fail when trying to operate on the non-existent agent.
Additional Locations (1)
| try: | ||
| nixl_agent.remove_remote_agent(evicted_agent_name) | ||
| except Exception as e: | ||
| print(f"Warning: Failed to remove remote agent: {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using print() instead of logging for warnings
Low Severity
The code uses print() for warning messages instead of proper logging. The sibling file gpu_object_manager.py in the same directory correctly uses logging.getLogger(__name__) for all log output. Using print() is inconsistent with the module's logging pattern, makes it harder to filter/configure log output, and bypasses the logging infrastructure used throughout Ray.
| agent_meta = nixl_agent.get_agent_metadata() | ||
| agent_name = nixl_agent.name | ||
| if self._memory_deregistered: | ||
| agent_partial_meta = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there was a deregister_memory called before, then we should not use the partial agent meta. Otherwise, when the receiver calls add_remote_agent(partial_agent_metadata), there would be a NIXL_ERR_NOT_ALLOWED thrown.
This is actually a TODO in nixl (search // TODO: Support metadata updates). When the receiver uses partial agent meta to update an existing remote agent (the sender), it will check the address of the newly-registered memory (which may reuse the address of a (previously) de-registered memory at the sender). If the address conflicts with any other memory (which has been de-registered at the sender but the receiver has not been notified), it throws NIXL_ERR_NOT_ALLOWED.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means we can benefit from agent reuse only if there is no de-registration between consecutive transfers.
| nixl_agent.add_remote_agent( | ||
| tensor_transport_metadata.nixl_agent_partial_meta | ||
| ) | ||
| else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid the NIXL_ERR_NOT_ALLOWD mentioned above, we have to remove and add remote agent (with full agent meta) here.


Description
Reuse nixl remote agent to avoid cold start for each RDT.
Related issues
#60524
Additional information