[CI] Add PPL test for generation models #24485

noooop · 2025-09-09T06:20:15Z

TL;DR

assert The PPL difference between vllm and Transformers is less than 1%.

Purpose

This test references https://huggingface.co/docs/transformers/perplexity

Perplexity (PPL) is one of the most common metrics for evaluating language models.

PPL test is an excellent test for verifying the implementation and find numerical issue of generative models (vs logprobs test)

Faster: PPL test contains only the prefill phase
(almost) No randomness: Does not include the sampling. Will not be affected by different implementations of the sampler either.
Statistical strength: PPL test outputs a logprobs for (almost) each and every token, wikitext-2-raw-v1 has approximately 280,000+ tokens, so the variance of the results is relatively small
The result is a floating number: This allows determining a threshold by comparing multiple models. Allow comparing vllm results with a constant to speed up testing
Convenient for comparing the corresponding quantization models, fp8 kv_cache ....., even transformers do not have the corresponding implementation

cc @DarkLight1337

Test Plan

tests/models/language/generation_ppl_test/test_gpt.py 71.29s for 2 models
tests/models/language/generation_ppl_test/test_qwen.py 51.90s for 1 models
tests/models/language/generation_ppl_test/test_gemma.py 183.75s for 3 models

Test Result

references https://huggingface.co/docs/transformers/perplexity

When we run the above with stride = 1024, i.e. no overlap, the resulting PPL is 19.44, which is about the same as the 19.93 reported in the GPT-2 paper.


Model: Qwen/Qwen3-0.6B
VLLM: torch.bfloat16 23.85902976989746
Transformers: torch.bfloat16 23.859342575073242
Difference (%): -0.001311038536778668
PASSED

Model: Qwen/Qwen3-0.6B-FP8
VLLM: torch.bfloat16 24.331377029418945
Transformers: torch.bfloat16 24.322189331054688
Difference (%): 0.03777496441295651
PASSED      

Model: openai-community/gpt2-large
VLLM: torch.bfloat16 19.455724716186523
Transformers: torch.bfloat16 19.456192016601562
Difference (%): -0.002401808198851681
PASSED

Model: google/gemma-2b
VLLM: torch.bfloat16 21.491100311279297
Transformers: torch.bfloat16 21.554302215576172
Difference (%): -0.2932217599287546
PASSED


Model: google/gemma-2-2b
VLLM: torch.bfloat16 102.50028991699219
Transformers: torch.bfloat16 102.73869323730469
Difference (%): -0.23204823110007702
PASSED


Model: google/gemma-3-4b-it
VLLM: torch.bfloat16 27.99801254272461
Transformers: torch.bfloat16 27.951711654663086
Difference (%): 0.1656459848812129
PASSED

Threshold

Model	Transformers:auto	VLLM:auto-diff	Transformers:float32-diff	VLLM:float32-diff
Qwen/Qwen3-0.6B	23.85934258	-0.001%	-0.020%	-0.020%
Qwen/Qwen3-0.6B-FP8	24.32218933	0.038%	-0.040%
openai-community/gpt2-large	19.45619202	-0.002%	-0.051%	-0.051%
google/gemma-2b	21.55430222	-0.293%	-0.194%	-0.161%
google/gemma-2-2b	102.7386932	-0.232%	-1.713%
google/gemma-3-4b-it	27.95171165	0.166%	-0.340%

PPL_TOL = 0.01

The VLLM auto dtype and Transformers auto dtype differ by less than 1%, not too bad.

Known Issues

There is a significant difference between using bfloat16 and float32 for inference.
e.g. google/gemma-2-2b, 1.713% (Transformers float32 dtype vs Transformers auto dtype)

We might need to use float32 in some critical operator

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: wang.yuqi <noooop@126.com>

DarkLight1337

Let's make this a separate step in the CI since the Extended models test is taking quite long already

noooop · 2025-09-09T08:08:52Z

Let's make this a separate step in the CI since the Extended models test is taking quite long already

The generated models currently lack a fast, strong, robust, and find numerical issue test that covers almost all models as the MTEB test does in pooling models.

You have already found out that this test may eventually cover almost all generative models

Signed-off-by: wang.yuqi <noooop@126.com>

.buildkite/test-pipeline.yaml

tests/models/language/generation_ppl_test/ppl_utils.py

noooop · 2025-09-09T09:37:17Z

Language Models Test (PPL) Ran in 17m 1s, pytest report ran in 799.89s (0:13:19)

I think this test is less than 5 minutes, downloading and loading model is really slow

Can we optimize CI runtime speed through model prefetching?

DarkLight1337 · 2025-09-09T17:38:56Z

Can we optimize CI runtime speed through model prefetching?

After the first run, the downloaded models should be cached in CI environment

Signed-off-by: wang.yuqi <noooop@126.com>

noooop · 2025-09-10T02:51:31Z

@DarkLight1337

Can we still pass with the current threshold?

The PPL difference between vllm and Transformers is less than 1%.

Sounds very intuitive.

DarkLight1337 · 2025-09-10T04:48:15Z

Yes

noooop · 2025-09-10T09:19:08Z

@DarkLight1337

Is there anything else that needs to be modified in this PR?

DarkLight1337

Nope, LGTM

22quinn · 2025-09-11T08:16:34Z

PPL test is important, thank you!!

maxdebayser · 2025-09-12T13:41:26Z

Awesome work, thanks!

Signed-off-by: wang.yuqi <noooop@126.com>

Signed-off-by: wang.yuqi <noooop@126.com> Signed-off-by: rogeryoungh <rogeryoungh@foxmail.com>

Signed-off-by: wang.yuqi <noooop@126.com> Signed-off-by: bruceszchen <bruceszchen@tencent.com>

mergify bot added the qwen Related to Qwen models label Sep 9, 2025

noooop force-pushed the ppl branch from 0f1fc1d to 156dab2 Compare September 9, 2025 07:14

+ ppl test

156dab2

Signed-off-by: wang.yuqi <noooop@126.com>

DarkLight1337 reviewed Sep 9, 2025

View reviewed changes

mergify bot added the ci/build label Sep 9, 2025

noooop added 2 commits September 9, 2025 16:22

+ test_gemma

62dc1ee

Signed-off-by: wang.yuqi <noooop@126.com>

Language Models Test (PPL)

cb6f2e7

Signed-off-by: wang.yuqi <noooop@126.com>

noooop force-pushed the ppl branch from 1e0a3e7 to cb6f2e7 Compare September 9, 2025 08:22

noooop commented Sep 9, 2025

View reviewed changes

.buildkite/test-pipeline.yaml Show resolved Hide resolved

noooop marked this pull request as ready for review September 9, 2025 08:26

noooop requested a review from ywang96 as a code owner September 9, 2025 08:26

noooop commented Sep 9, 2025

View reviewed changes

tests/models/language/generation_ppl_test/ppl_utils.py Show resolved Hide resolved

Merge branch 'main' into ppl

4c54a6b

noooop force-pushed the ppl branch from 7dec1aa to f4485bc Compare September 10, 2025 01:45

noooop added 5 commits September 10, 2025 09:45

update

a0ce721

Signed-off-by: wang.yuqi <noooop@126.com>

+ Comments

f4485bc

Signed-off-by: wang.yuqi <noooop@126.com>

+ Difference (%)

29860d0

Signed-off-by: wang.yuqi <noooop@126.com>

+ Difference (%)

1d21bf9

Signed-off-by: wang.yuqi <noooop@126.com>

PPL_TOL = 0.01

4113df5

Signed-off-by: wang.yuqi <noooop@126.com>

noooop force-pushed the ppl branch from a5c1193 to 4113df5 Compare September 10, 2025 02:37

noooop changed the title ~~[CI] Add PPL test for generative models~~ [CI] Add PPL test for generation models Sep 10, 2025

noooop mentioned this pull request Sep 10, 2025

[Model] Systematic support for fp32 head, generation models part #24567

Draft

5 tasks

DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 10, 2025

DarkLight1337 approved these changes Sep 10, 2025

View reviewed changes

DarkLight1337 enabled auto-merge (squash) September 10, 2025 11:31

vllm-bot merged commit bd98842 into vllm-project:main Sep 10, 2025
81 of 86 checks passed

noooop deleted the ppl branch September 10, 2025 14:03

22quinn added the rl Related to RL workflows label Sep 11, 2025

skyloevil pushed a commit to skyloevil/vllm that referenced this pull request Sep 13, 2025

[CI] Add PPL test for generation models (vllm-project#24485)

e826f1e

Signed-off-by: wang.yuqi <noooop@126.com>

rogeryoungh pushed a commit to MiniMax-AI/vllm that referenced this pull request Sep 15, 2025

[CI] Add PPL test for generation models (vllm-project#24485)

1b1f926

Signed-off-by: wang.yuqi <noooop@126.com> Signed-off-by: rogeryoungh <rogeryoungh@foxmail.com>

cboss6 pushed a commit to cboss6/vllm that referenced this pull request Sep 16, 2025

[CI] Add PPL test for generation models (vllm-project#24485)

b5a35b2

Signed-off-by: wang.yuqi <noooop@126.com> Signed-off-by: bruceszchen <bruceszchen@tencent.com>

cboss6 pushed a commit to cboss6/vllm that referenced this pull request Sep 16, 2025

[CI] Add PPL test for generation models (vllm-project#24485)

d4c352b

Signed-off-by: wang.yuqi <noooop@126.com> Signed-off-by: bruceszchen <bruceszchen@tencent.com>

noooop mentioned this pull request Sep 16, 2025

[V1] Add sliding window support to Flex Attention backend #24089

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[CI] Add PPL test for generation models #24485

[CI] Add PPL test for generation models #24485

Uh oh!

noooop commented Sep 9, 2025 •

edited by github-actions bot

Loading

Uh oh!

DarkLight1337 left a comment

Uh oh!

noooop commented Sep 9, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

noooop commented Sep 9, 2025 •

edited

Loading

Uh oh!

DarkLight1337 commented Sep 9, 2025

Uh oh!

noooop commented Sep 10, 2025 •

edited

Loading

Uh oh!

DarkLight1337 commented Sep 10, 2025

Uh oh!

noooop commented Sep 10, 2025

Uh oh!

DarkLight1337 left a comment

Uh oh!

Uh oh!

22quinn commented Sep 11, 2025

Uh oh!

maxdebayser commented Sep 12, 2025

Uh oh!

Uh oh!

Uh oh!

[CI] Add PPL test for generation models #24485

[CI] Add PPL test for generation models #24485

Uh oh!

Conversation

noooop commented Sep 9, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

Purpose

Test Plan

Test Result

Threshold

Known Issues

Uh oh!

DarkLight1337 left a comment

Choose a reason for hiding this comment

Uh oh!

noooop commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

noooop commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DarkLight1337 commented Sep 9, 2025

Uh oh!

noooop commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DarkLight1337 commented Sep 10, 2025

Uh oh!

noooop commented Sep 10, 2025

Uh oh!

DarkLight1337 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

22quinn commented Sep 11, 2025

Uh oh!

maxdebayser commented Sep 12, 2025

Uh oh!

Uh oh!

noooop commented Sep 9, 2025 •

edited by github-actions bot

Loading

noooop commented Sep 9, 2025 •

edited

Loading

noooop commented Sep 9, 2025 •

edited

Loading

noooop commented Sep 10, 2025 •

edited

Loading