Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow modification of zero partitioned parameters #4192

Merged
merged 10 commits into from
Sep 2, 2023
Merged

Conversation

tjruwase
Copy link
Contributor

Utilities for flexible modification of partitioned fp32 parameters and optimizer states.

Fix #3830

Copy link
Contributor

@tohtana tohtana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great feature! The code modifications and the document are also clear. I do have one observation, though it's not immediately pressing:

Currently we have three get_* functions (safe_get_full_fp32_param, safe_get_full_grad, and safe_get_full_optimizer_state). This PR introduces safe_set_full_fp32_param and safe_set_full_optimizer_state. Is there a specific reason we're omitting safe_set_full_grad?
Maintaining consistency in the APIs can help users understand the design better.

deepspeed/runtime/zero/stage3.py Outdated Show resolved Hide resolved
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
@tjruwase
Copy link
Contributor Author

This is a great feature! The code modifications and the document are also clear. I do have one observation, though it's not immediately pressing:

Currently we have three get_* functions (safe_get_full_fp32_param, safe_get_full_grad, and safe_get_full_optimizer_state). This PR introduces safe_set_full_fp32_param and safe_set_full_optimizer_state. Is there a specific reason we're omitting safe_set_full_grad? Maintaining consistency in the APIs can help users understand the design better.

@tohtana, thanks for this valid question. I am delaying support for safe_set_full_grad until there is explicit request for it because it is harder to implement and I have limited bandwidth to think through all the design issues :(. I will add a TODO for this. Thanks for the review.

@tjruwase tjruwase added this pull request to the merge queue Aug 31, 2023
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 31, 2023
@mrwyattii mrwyattii added this pull request to the merge queue Sep 1, 2023
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Sep 1, 2023
@mrwyattii mrwyattii added this pull request to the merge queue Sep 1, 2023
Merged via the queue into master with commit a23cda6 Sep 2, 2023
@mayank31398
Copy link
Contributor

mayank31398 commented Sep 3, 2023

@tjruwase
can we get the method safe_set_gradients?
This is required for using Megatron's sequence parallel instead of Ulysees.
Wanted to check if this is possible anyhow and would be super-useful.

@mayank31398
Copy link
Contributor

Essentialy whats needed:

grad = get_grads(layernorm.weight)
dist.all_reduce(grad, group=tp_group)
safe_set_grads(grad, layernorm.weight)

if there is an alternative way to do it, that will also be helpful.

@loadams loadams deleted the olruwase/ds_3830 branch February 28, 2024 18:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

How to modify weights during training in a deepspeed stage 3 model
5 participants