-
-
Notifications
You must be signed in to change notification settings - Fork 10.3k
[Model] Systematic support for fp32 head, generation models part #24567
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
cc @22quinn I was wondering if using an fp32 head is critical for the success of RLHF training. How can we construct a test to verify it I'm not very familiar with RLHF. I came up with a simple estimation method. (my proposed estimation method maybe completely nonsense |
cc: @houseroad @yeqcharlotte @hijkzzz Welcome to the discussion |
This pull request has merge conflicts that must be resolved before it can be |
Both fp32 LM head and perplexity eval were in my backlog :) Thanks a lot for this! I think there are a few aspects here:
For 2), I'm not qualified to answer this before we get more data internally. cc @zhuohan123 in case any prior experience |
6801a36
to
e53b713
Compare
In Off-Policy RL, the LM head uses W16A16 (model dtype bfloat16), W16A32 (model dtype bfloat16 + out_dtype float32), W32A32 (LM head native float32 weight) -> will it affect the RL outcome? |
Hi, thanks for asking! Based on our experimental results, we found using FP32 LM head is a good-to-have feature but may not necessarily bring performance gain nor fix the rollout-training mismatch gap. Here are some key observations to share: 1. MiniMax-M1 Observations 2. Our Findings on DAPO-32B Therefore, while FP32 LM head may help in some architectures (like linear attention), its benefits appear context-dependent not universally applied. More broadly, we still need better understanding and more data to conclude whether improved perplexity directly correlates with better RL outcomes. |
Thank you for sharing
I did not find "the Figure 1" in the DAPO paper. Could you please show me where I can see it? |
Hi, I guess you mean the Figure 1 I showed in my previous reply? If it is, the Figure 1 is from our blog here: https://fengyao.notion.site/off-policy-rl, where the red line in Figure 1 indicates using FP32 LM head. :) |
Thanks |
Stupid question:
(By the way, Qwen 32B does not use tie_word_embeddings, This makes many things simpler. |
e53b713
to
c694835
Compare
Signed-off-by: wang.yuqi <noooop@126.com>
Purpose
Follow #23810
"head" refers to the last Linear layer(s) of an LLM, such as the lm_head in a generation model, or the score or classifier in a classification model.
An increasing amount of evidence suggests that using an fp32 head can improve numerical precision.
[Feature]: Support casting lm_head to FP32 to get old logprobs in RLHF #19925
pooling models part PTAL #23810
Fix #19925
needs modified
Test Plan
Test Result
PPL test PTAL #24485
The smaller, the better.
Negative values mean that the vllm result is smaller than Transformers:float32
Using a float32 head is indeed better for the PPL test, sometimes even better than using float32 for all parameters.
Discussion
Ultimate question: Is this difference Really Matter in RLHF?
cc @22quinn @houseroad @yeqcharlotte @hijkzzz
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.