Skip to content

Conversation

noooop
Copy link
Contributor

@noooop noooop commented Sep 9, 2025

TL;DR

assert The PPL difference between vllm and Transformers is less than 1%.

Purpose

This test references https://huggingface.co/docs/transformers/perplexity

Perplexity (PPL) is one of the most common metrics for evaluating language models.

PPL test is an excellent test for verifying the implementation and find numerical issue of generative models (vs logprobs test)

  • Faster: PPL test contains only the prefill phase
  • (almost) No randomness: Does not include the sampling. Will not be affected by different implementations of the sampler either.
  • Statistical strength: PPL test outputs a logprobs for (almost) each and every token, wikitext-2-raw-v1 has approximately 280,000+ tokens, so the variance of the results is relatively small
  • The result is a floating number: This allows determining a threshold by comparing multiple models. Allow comparing vllm results with a constant to speed up testing
  • Convenient for comparing the corresponding quantization models, fp8 kv_cache ....., even transformers do not have the corresponding implementation

cc @DarkLight1337

Test Plan

tests/models/language/generation_ppl_test/test_gpt.py 71.29s for 2 models
tests/models/language/generation_ppl_test/test_qwen.py 51.90s for 1 models
tests/models/language/generation_ppl_test/test_gemma.py 183.75s for 3 models

Test Result

references https://huggingface.co/docs/transformers/perplexity

When we run the above with stride = 1024, i.e. no overlap, the resulting PPL is 19.44, which is about the same as the 19.93 reported in the GPT-2 paper.


Model: Qwen/Qwen3-0.6B
VLLM: torch.bfloat16 23.85902976989746
Transformers: torch.bfloat16 23.859342575073242
Difference (%): -0.001311038536778668
PASSED

Model: Qwen/Qwen3-0.6B-FP8
VLLM: torch.bfloat16 24.331377029418945
Transformers: torch.bfloat16 24.322189331054688
Difference (%): 0.03777496441295651
PASSED      

Model: openai-community/gpt2-large
VLLM: torch.bfloat16 19.455724716186523
Transformers: torch.bfloat16 19.456192016601562
Difference (%): -0.002401808198851681
PASSED

Model: google/gemma-2b
VLLM: torch.bfloat16 21.491100311279297
Transformers: torch.bfloat16 21.554302215576172
Difference (%): -0.2932217599287546
PASSED


Model: google/gemma-2-2b
VLLM: torch.bfloat16 102.50028991699219
Transformers: torch.bfloat16 102.73869323730469
Difference (%): -0.23204823110007702
PASSED


Model: google/gemma-3-4b-it
VLLM: torch.bfloat16 27.99801254272461
Transformers: torch.bfloat16 27.951711654663086
Difference (%): 0.1656459848812129
PASSED

Threshold

Model Transformers:auto VLLM:auto-diff Transformers:float32-diff VLLM:float32-diff
Qwen/Qwen3-0.6B 23.85934258 -0.001% -0.020% -0.020%
Qwen/Qwen3-0.6B-FP8 24.32218933 0.038% -0.040%
openai-community/gpt2-large 19.45619202 -0.002% -0.051% -0.051%
google/gemma-2b 21.55430222 -0.293% -0.194% -0.161%
google/gemma-2-2b 102.7386932 -0.232% -1.713%
google/gemma-3-4b-it 27.95171165 0.166% -0.340%

PPL_TOL = 0.01

The VLLM auto dtype and Transformers auto dtype differ by less than 1%, not too bad.

Known Issues

There is a significant difference between using bfloat16 and float32 for inference.
e.g. google/gemma-2-2b, 1.713% (Transformers float32 dtype vs Transformers auto dtype)

We might need to use float32 in some critical operator


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify mergify bot added the qwen Related to Qwen models label Sep 9, 2025
Signed-off-by: wang.yuqi <noooop@126.com>
Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this a separate step in the CI since the Extended models test is taking quite long already

@noooop
Copy link
Contributor Author

noooop commented Sep 9, 2025

Let's make this a separate step in the CI since the Extended models test is taking quite long already

The generated models currently lack a fast, strong, robust, and find numerical issue test that covers almost all models as the MTEB test does in pooling models.

You have already found out that this test may eventually cover almost all generative models

@mergify mergify bot added the ci/build label Sep 9, 2025
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <noooop@126.com>
@noooop noooop marked this pull request as ready for review September 9, 2025 08:26
@noooop noooop requested a review from ywang96 as a code owner September 9, 2025 08:26
@noooop
Copy link
Contributor Author

noooop commented Sep 9, 2025

Language Models Test (PPL) Ran in 17m 1s, pytest report ran in 799.89s (0:13:19)

I think this test is less than 5 minutes, downloading and loading model is really slow


Can we optimize CI runtime speed through model prefetching?

@DarkLight1337
Copy link
Member

Can we optimize CI runtime speed through model prefetching?

After the first run, the downloaded models should be cached in CI environment

Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <noooop@126.com>
@noooop
Copy link
Contributor Author

noooop commented Sep 10, 2025

@DarkLight1337

Can we still pass with the current threshold?


The PPL difference between vllm and Transformers is less than 1%.

Sounds very intuitive.

@DarkLight1337
Copy link
Member

Yes

@noooop noooop changed the title [CI] Add PPL test for generative models [CI] Add PPL test for generation models Sep 10, 2025
@noooop
Copy link
Contributor Author

noooop commented Sep 10, 2025

@DarkLight1337

Is there anything else that needs to be modified in this PR?

@DarkLight1337 DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 10, 2025
Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, LGTM

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) September 10, 2025 11:31
@vllm-bot vllm-bot merged commit bd98842 into vllm-project:main Sep 10, 2025
81 of 86 checks passed
@noooop noooop deleted the ppl branch September 10, 2025 14:03
@22quinn
Copy link
Collaborator

22quinn commented Sep 11, 2025

PPL test is important, thank you!!

@22quinn 22quinn added the rl Related to RL workflows label Sep 11, 2025
@maxdebayser
Copy link
Contributor

Awesome work, thanks!

skyloevil pushed a commit to skyloevil/vllm that referenced this pull request Sep 13, 2025
Signed-off-by: wang.yuqi <noooop@126.com>
rogeryoungh pushed a commit to MiniMax-AI/vllm that referenced this pull request Sep 15, 2025
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: rogeryoungh <rogeryoungh@foxmail.com>
cboss6 pushed a commit to cboss6/vllm that referenced this pull request Sep 16, 2025
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: bruceszchen <bruceszchen@tencent.com>
cboss6 pushed a commit to cboss6/vllm that referenced this pull request Sep 16, 2025
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: bruceszchen <bruceszchen@tencent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed rl Related to RL workflows
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants