-
Notifications
You must be signed in to change notification settings - Fork 3k
[fsdp, megatron] fix: Engine Rollout Worker LoRA Parameter Update #4836
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fsdp, megatron] fix: Engine Rollout Worker LoRA Parameter Update #4836
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a bug that occurs when updating vLLM LoRA weights with the FSDP Engine, which was caused by a missing layered_summon argument. The fix involves passing the layered_summon and base_sync_done arguments to the get_per_tensor_param method within the wake_up function. The change appears correct and directly solves the issue described. I have included one suggestion to improve code maintainability by using an existing class attribute.
|
We may need more comprehensive review lora support for both FSDP and Megatron. |
|
|
||
| # 1. get per tensor generator from engine, this will load model to gpu | ||
| per_tensor_param, peft_config = self.actor.engine.get_per_tensor_param() | ||
| per_tensor_param, peft_config = self.actor.engine.get_per_tensor_param( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JacobHelwig Megatron engine get_per_tensor_param doesn't accept layered_summon and base_sync_done, please fix it.
https://github.com/volcengine/verl/blob/main/verl/workers/engine/megatron/transformer_impl.py#L539
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, thank you. I fixed it using kwargs in the Megatron engine and added a fix for LoRA with the Megatron engine (please see updated description).
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively addresses two critical bugs related to LoRA parameter updates for FSDP and Megatron workers. For the Megatron worker, the fix correctly distinguishes between vanilla_bridge and the standard Megatron-Bridge to call the appropriate weight export method (export_weights vs. export_hf_weights), resolving an AttributeError. For the FSDP worker, passing layered_summon and base_sync_done to get_per_tensor_param ensures that only LoRA-specific parameters are passed to vLLM, fixing a ValueError. The changes are well-targeted and correctly resolve the underlying issues. The implementation is clean and I have no further suggestions.
What does this PR do?
Updating vLLM LoRA weights raises an error for both FSDP and Megatron workers.
FSDP: error due to not passing arguments for layered summon
Megatron: error due to using incorrect export weights method for class
AutoBridgeTest
FSDP
Script:
Error:
Megatron
Script:
Error: