Skip to content

[Bugfix][1/n] Fix the speculative decoding test by setting the target dtype #19633

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 14, 2025
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 24 additions & 0 deletions tests/spec_decode/e2e/test_multistep_correctness.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,9 @@

# Skip cuda graph recording for fast test.
"enforce_eager": True,

# The original model is float32, keep it for numerical stability.
"dtype": "float32",
Comment on lines +61 to +62
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This dtype setting and its accompanying comment are repeated in several places throughout this file. To enhance maintainability and adhere to the DRY (Don't Repeat Yourself) principle, consider defining these common keyword arguments in a shared dictionary at the module level. This dictionary can then be spread into each specific common_llm_kwargs dictionary where this setting is needed.

For instance, you could define a constant like this at the module scope:

_COMMON_FLOAT32_KWARGS = {
    # The original model is float32, keep it for numerical stability.
    "dtype": "float32",
}

And then utilize it within the parametrize decorator as shown in the suggestion below. This approach centralizes the configuration, making future updates simpler and more consistent.

Suggested change
# The original model is float32, keep it for numerical stability.
"dtype": "float32",
**_COMMON_FLOAT32_KWARGS,

}])
@pytest.mark.parametrize(
"per_test_common_llm_kwargs",
Expand Down Expand Up @@ -139,6 +142,9 @@ def test_spec_decode_e2e_with_detokenization(test_llm_generator,

# Print spec metrics.
"disable_log_stats": False,

# The original model is float32, keep it for numerical stability.
"dtype": "float32",
Comment on lines +146 to +147
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the previous comment, this dtype configuration is repeated. Applying the shared dictionary pattern (e.g., using **_COMMON_FLOAT32_KWARGS) here would improve maintainability by centralizing this common setting.

Suggested change
# The original model is float32, keep it for numerical stability.
"dtype": "float32",
**_COMMON_FLOAT32_KWARGS,

}])
@pytest.mark.parametrize(
"per_test_common_llm_kwargs",
Expand Down Expand Up @@ -216,6 +222,9 @@ def test_spec_decode_e2e_greedy_correctness_tiny_model_bs1(

# Print spec metrics.
"disable_log_stats": False,

# The original model is float32, keep it for numerical stability.
"dtype": "float32",
Comment on lines +226 to +227
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This is another instance of the repeated dtype configuration. Using a shared dictionary (e.g., **_COMMON_FLOAT32_KWARGS) would be beneficial for maintainability here as well.

Suggested change
# The original model is float32, keep it for numerical stability.
"dtype": "float32",
**_COMMON_FLOAT32_KWARGS,

}])
@pytest.mark.parametrize(
"per_test_common_llm_kwargs",
Expand Down Expand Up @@ -279,6 +288,9 @@ def test_spec_decode_e2e_greedy_correctness_tiny_model_large_bs(
[{
# Skip cuda graph recording for fast test.
"enforce_eager": True,

# The original model is float32, keep it for numerical stability.
"dtype": "float32",
}])
@pytest.mark.parametrize(
"per_test_common_llm_kwargs",
Expand Down Expand Up @@ -464,6 +476,8 @@ def test_spec_decode_e2e_greedy_correctness_real_model_large_bs(

# Skip cuda graph recording for fast test.
"enforce_eager": True,
# The original model is float32, keep it for numerical stability.
"dtype": "float32",
Comment on lines +479 to +480
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The dtype configuration is repeated here. Consider using the suggested shared dictionary pattern (e.g., **_COMMON_FLOAT32_KWARGS) to centralize this setting for better maintainability.

Suggested change
# The original model is float32, keep it for numerical stability.
"dtype": "float32",
**_COMMON_FLOAT32_KWARGS,

}])
@pytest.mark.parametrize("per_test_common_llm_kwargs", [
{
Expand Down Expand Up @@ -523,6 +537,8 @@ def test_spec_decode_e2e_greedy_correctness_with_preemption(

# Skip cuda graph recording for fast test.
"enforce_eager": True,
# The original model is float32, keep it for numerical stability.
"dtype": "float32",
Comment on lines +540 to +541
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This dtype setting is repeated. Applying the shared dictionary pattern (e.g., **_COMMON_FLOAT32_KWARGS) would improve code maintainability.

Suggested change
# The original model is float32, keep it for numerical stability.
"dtype": "float32",
**_COMMON_FLOAT32_KWARGS,

}])
@pytest.mark.parametrize(
"per_test_common_llm_kwargs",
Expand Down Expand Up @@ -589,6 +605,8 @@ def test_spec_decode_different_block_size(vllm_runner, common_llm_kwargs,

# Skip cuda graph recording for fast test.
"enforce_eager": True,
# The original model is float32, keep it for numerical stability.
"dtype": "float32",
Comment on lines +608 to +609
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Again, the dtype configuration is duplicated. Using a shared dictionary (e.g., **_COMMON_FLOAT32_KWARGS) would help in centralizing this common parameter.

Suggested change
# The original model is float32, keep it for numerical stability.
"dtype": "float32",
**_COMMON_FLOAT32_KWARGS,

}])
@pytest.mark.parametrize("per_test_common_llm_kwargs", [{}])
@pytest.mark.parametrize("baseline_llm_kwargs", [{}])
Expand Down Expand Up @@ -655,6 +673,8 @@ def test_skip_speculation(vllm_runner, common_llm_kwargs,

# Skip cuda graph recording for fast test.
"enforce_eager": True,
# The original model is float32, keep it for numerical stability.
"dtype": "float32",
Comment on lines +676 to +677
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This dtype setting is repeated. Employing the shared dictionary pattern (e.g., **_COMMON_FLOAT32_KWARGS) would enhance maintainability.

Suggested change
# The original model is float32, keep it for numerical stability.
"dtype": "float32",
**_COMMON_FLOAT32_KWARGS,

}])
@pytest.mark.parametrize("per_test_common_llm_kwargs", [{}])
@pytest.mark.parametrize("baseline_llm_kwargs", [{}])
Expand Down Expand Up @@ -706,6 +726,8 @@ def test_disable_speculation(vllm_runner, common_llm_kwargs,

# Skip cuda graph recording for fast test.
"enforce_eager": True,
# The original model is float32, keep it for numerical stability.
"dtype": "float32",
Comment on lines +729 to +730
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The dtype configuration appears again. Centralizing this with a shared dictionary (e.g., **_COMMON_FLOAT32_KWARGS) is recommended for better maintainability.

Suggested change
# The original model is float32, keep it for numerical stability.
"dtype": "float32",
**_COMMON_FLOAT32_KWARGS,

}])
@pytest.mark.parametrize("per_test_common_llm_kwargs", [{}])
@pytest.mark.parametrize("baseline_llm_kwargs", [{}])
Expand Down Expand Up @@ -763,6 +785,8 @@ def test_many_k(vllm_runner, common_llm_kwargs, per_test_common_llm_kwargs,

# Skip cuda graph recording for fast test.
"enforce_eager": True,
# The original model is float32, keep it for numerical stability.
"dtype": "float32",
Comment on lines +788 to +789
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This is the final instance of the repeated dtype configuration in this PR. Using a shared dictionary (e.g., **_COMMON_FLOAT32_KWARGS) would be beneficial for maintainability.

Suggested change
# The original model is float32, keep it for numerical stability.
"dtype": "float32",
**_COMMON_FLOAT32_KWARGS,

}])
@pytest.mark.parametrize("per_test_common_llm_kwargs", [{}])
@pytest.mark.parametrize("baseline_llm_kwargs", [{}])
Expand Down