Fix gemma2 accuracy through the correct softcapping logic #2842

AllentDan · 2024-12-02T03:35:57Z

NOTE
There is still another place that is not aligned with gemma2, but I don't think that is important.
Gemma2 implementation of RMS Norm:

class DefaultRMSNormImpl(RMSNormImpl):
    """RMS norm implementation api."""

    def __init__(self, hidden_size: int, eps: float = 1e-6):
        self.hidden_size = hidden_size
        self.eps = eps

    def forward(self,
                x: torch.Tensor,
                weight: torch.Tensor,
                residual: torch.Tensor = None):
        """forward."""
        input_dtype = x.dtype
        if residual is not None:
            x = x + residual
            residual = x
        x = x.to(torch.float32)
        variance = x.pow(2).mean(-1, keepdim=True)
        x = x * torch.rsqrt(variance + self.eps)
        x = (weight.float() * x).to(input_dtype)
        if residual is None:
            return x
        return x, residual

Llama does x.to(float16) * w whilst Gemma is (x * w).to(float16)
See huggingface/transformers#29402

AllentDan added 2 commits December 2, 2024 11:28

Fix gemma2 accuracy through the correct softcapping logic

6a17118

remove debugging codes

4a596d1

lvhan028 requested a review from grimoire December 2, 2024 05:33

lvhan028 added the Bug:P1 label Dec 2, 2024

grimoire approved these changes Dec 2, 2024

View reviewed changes

lvhan028 merged commit b91ce9a into InternLM:main Dec 2, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix gemma2 accuracy through the correct softcapping logic #2842

Fix gemma2 accuracy through the correct softcapping logic #2842

AllentDan commented Dec 2, 2024

Fix gemma2 accuracy through the correct softcapping logic #2842

Fix gemma2 accuracy through the correct softcapping logic #2842

Conversation

AllentDan commented Dec 2, 2024