Skip to content

Conversation

@st-bang97
Copy link

Description

This PR addresses Issue #7819.

When using ZeRO Stage-3 + CPU offloaded optimizer (CPUAdam) with 2+ subgroups, ds_adam_step can be invoked multiple times within a single global optimizer step. Before this fix, the internal bias-correction-related state (e.g., _betta2_t_bias_correction2) could become inconsistent across subgroup invocations within the same step, leading to subgroup-wise optimizer state divergence.


Solution

Make IncrementStep() step-consistent under repeated calls in the same global step.

Change

Update IncrementStep() to only advance or recompute state when step != _step, preventing subgroup-to-subgroup drift inside a single step.

inline void IncrementStep(size_t step, float beta1, float beta2)
{
    if (beta1 != _betta1 || beta2 != _betta2) {
        _step = step;
        _betta1 = beta1;
        _betta2 = beta2;
        _betta1_t = std::pow(_betta1, step);
        _betta2_t = std::pow(_betta2, step);
    } else {
        if (step != _step) {
            _step++;
            if (_step != step) {
                _betta1_t = std::pow(_betta1, step);
                _betta2_t = std::pow(_betta2, step);
                _step = step;
            } else {
                _betta1_t *= _betta1;
                _betta2_t *= _betta2;
            }
        }
    }
}
  1. Reproduction (Before Fix)
스크린샷 2026-01-27 194552 스크린샷 2026-01-27 194716
  1. Verification (After Fix)
스크린샷 2026-01-27 200422 스크린샷 2026-01-27 200410

@st-bang97 st-bang97 requested a review from tjruwase as a code owner January 27, 2026 11:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant