-
-
Notifications
You must be signed in to change notification settings - Fork 10.3k
[CI] Add PPL test for generation models #24485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: wang.yuqi <noooop@126.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make this a separate step in the CI since the Extended models test is taking quite long already
The generated models currently lack a fast, strong, robust, and find numerical issue test that covers almost all models as the MTEB test does in pooling models. You have already found out that this test may eventually cover almost all generative models |
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <noooop@126.com>
Language Models Test (PPL) Ran in 17m 1s, pytest report ran in 799.89s (0:13:19) I think this test is less than 5 minutes, downloading and loading model is really slow Can we optimize CI runtime speed through model prefetching? |
After the first run, the downloaded models should be cached in CI environment |
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <noooop@126.com>
Can we still pass with the current threshold? The PPL difference between vllm and Transformers is less than 1%. Sounds very intuitive. |
Yes |
Is there anything else that needs to be modified in this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, LGTM
PPL test is important, thank you!! |
Awesome work, thanks! |
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <noooop@126.com> Signed-off-by: rogeryoungh <rogeryoungh@foxmail.com>
Signed-off-by: wang.yuqi <noooop@126.com> Signed-off-by: bruceszchen <bruceszchen@tencent.com>
Signed-off-by: wang.yuqi <noooop@126.com> Signed-off-by: bruceszchen <bruceszchen@tencent.com>
TL;DR
assert The PPL difference between vllm and Transformers is less than 1%.
Purpose
This test references https://huggingface.co/docs/transformers/perplexity
PPL test is an excellent test for verifying the implementation and find numerical issue of generative models (vs logprobs test)
cc @DarkLight1337
Test Plan
tests/models/language/generation_ppl_test/test_gpt.py 71.29s for 2 models
tests/models/language/generation_ppl_test/test_qwen.py 51.90s for 1 models
tests/models/language/generation_ppl_test/test_gemma.py 183.75s for 3 models
Test Result
references https://huggingface.co/docs/transformers/perplexity
Threshold
PPL_TOL = 0.01
The VLLM auto dtype and Transformers auto dtype differ by less than 1%, not too bad.
Known Issues
There is a significant difference between using bfloat16 and float32 for inference.
e.g. google/gemma-2-2b, 1.713% (Transformers float32 dtype vs Transformers auto dtype)
We might need to use float32 in some critical operator
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.