[Cherry-Pick][OP][Feature] 统一 limit_thinking_content_length CUDA 算子，支持回复长度限制与注入序列 by yuanlehome · Pull Request #6511 · PaddlePaddle/FastDeploy

yuanlehome · 2026-02-26T02:18:24Z

Motivation

将 develop 上 PR #6493 的核心能力迁移到 release/2.5：统一 thinking 截断 CUDA 算子，并引入 response_max_tokens 端到端支持。该迁移同时对 release 分支现有目录与调用差异做了对齐，保持行为与上游改动一致。

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

算子统一（OP / SpecDecode）
- 合并 limit_thinking_content_length_v1/v2 为统一 GPU 算子 limit_thinking_content_length。
- 合并 speculate_limit_thinking_content_length_v1/v2 为统一推测解码算子 speculate_limit_thinking_content_length。
- 更新 cpp_extensions.cc 导出符号与 setup_ops.py 编译源，移除旧 v1/v2 文件。
推理后处理链路改造（Executor / Worker）
- pre_and_post_process.py 移除按 limit_strategy 分发的 Python 包装逻辑，直接调用统一算子。
- gpu_model_runner.py 新增并维护：
  - max_reply_lens
  - inject_token_ids
  - splitwise_role_is_decode 传递
- thinking/reply 限制在 normal/speculative 两条路径统一生效。
response_max_tokens 端到端支持（APIServer / DataProcessor / Engine）
- OpenAI 协议与采样参数新增 response_max_tokens 字段。
- engine_client 增加参数校验（response_max_tokens > 0）。
- 多个输入处理器统一 max_tokens 上限裁剪逻辑，并在 enable_thinking=False 时应用 response_max_tokens 约束。
- reasoning_max_tokens 校验口径同步（允许 0），并移除 D 侧 reasoning_max_tokens -= 1 的历史修正。
配置与启动参数透传
- 新增 think_truncate_prompt_ids 配置读取、worker 启动参数传递与共享输入注入，支持任意长度注入序列。
测试同步
- 删除与旧 v1/v2 算子强绑定的两份 operator 测试文件。
- 同步更新受影响的输入处理与 E2E 用例断言。

示例（统一算子调用形态）：

limit_thinking_content_length(
    sampled_token_ids,
    share_inputs["max_think_lens"],
    share_inputs["max_reply_lens"],
    share_inputs["step_idx"],
    share_inputs["limit_think_status"],
    share_inputs["stop_flags"],
    share_inputs["eos_token_id"],
    share_inputs["inject_token_ids"],
    think_end_id,
    splitwise_role_is_decode,
)

Usage or Command

新增请求参数（OpenAI 兼容）：

{
  "max_tokens": 256,
  "reasoning_max_tokens": 128,
  "response_max_tokens": 64
}

语义：

reasoning_max_tokens：限制思考阶段 token 数；
response_max_tokens：限制回复阶段 token 数（thinking 关闭时直接约束总回复长度）。

Accuracy Tests

该 PR 涉及 kernel 与后处理路径重构，准确性结果待补充（与 develop 对齐策略）。

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

https://api.github.com/repos/PaddlePaddle/FastDeploy/pulls/6493/files
- Triggering command: /home/REDACTED/work/_temp/ghcca-node/node/bin/node /home/REDACTED/work/_temp/ghcca-node/node/bin/node --enable-source-maps /home/REDACTED/work/_temp/copilot-developer-action-main/dist/index.js (http block)

If you need me to access, download, or install something from one of these locations, you can either:

Configure Actions setup steps to set up my environment, which run before the firewall is enabled
Add the appropriate URLs or hosts to the custom allowlist in this repository's Copilot coding agent settings (admins only)

🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. Learn more about Advanced Security.

…addlePaddle#6493) * Initial plan * Migrate PRs PaddlePaddle#6311, PaddlePaddle#6129, PaddlePaddle#6305 to develop and merge unit tests Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com> * fix * update * fix * fix ci * fix ci * Initial plan * test: add test_chat_with_response_max_tokens to test_EB_VL_Lite_serving.py Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com> * test: add disable-thinking case to test_chat_with_response_max_tokens Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com> * test: add both reasoning_max_tokens and response_max_tokens case Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com> * fix ci * fix ci * fix ci --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com>

paddle-bot · 2026-02-26T02:18:30Z

Thanks for your contribution!

codecov-commenter · 2026-02-26T04:08:02Z

Codecov Report

❌ Patch coverage is 65.04854% with 36 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (release/2.5@a368856). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/input/v1/ernie4_5_processor.py	41.66%	4 Missing and 3 partials ⚠️
.../v1/ernie4_5_vl_processor/ernie4_5_vl_processor.py	25.00%	4 Missing and 2 partials ⚠️
fastdeploy/entrypoints/engine_client.py	0.00%	3 Missing and 2 partials ⚠️
fastdeploy/input/v1/text_processor.py	57.14%	3 Missing ⚠️
fastdeploy/input/ernie4_5_processor.py	77.77%	1 Missing and 1 partial ⚠️
...put/ernie4_5_vl_processor/ernie4_5_vl_processor.py	75.00%	1 Missing and 1 partial ⚠️
fastdeploy/input/text_processor.py	75.00%	1 Missing and 1 partial ⚠️
...1/paddleocr_vl_processor/paddleocr_vl_processor.py	50.00%	1 Missing and 1 partial ⚠️
.../input/v1/qwen3_vl_processor/qwen3_vl_processor.py	50.00%	1 Missing and 1 partial ⚠️
...oy/input/v1/qwen_vl_processor/qwen_vl_processor.py	50.00%	1 Missing and 1 partial ⚠️
... and 3 more

Additional details and impacted files

@@              Coverage Diff               @@
##             release/2.5    #6511   +/-   ##
==============================================
  Coverage               ?   68.49%           
==============================================
  Files                  ?      391           
  Lines                  ?    52802           
  Branches               ?     8220           
==============================================
  Hits                   ?    36167           
  Misses                 ?    14004           
  Partials               ?     2631

Flag	Coverage Δ
GPU	`68.49% <65.04%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

yuanlehome had a problem deploying to Metax_ci February 26, 2026 02:18 — with GitHub Actions Error

Delete tests/model_executor/test_thinking_budget.py

5305be4

yuanlehome had a problem deploying to Metax_ci February 26, 2026 02:19 — with GitHub Actions Failure

fix

faba420

yuanlehome had a problem deploying to Metax_ci February 26, 2026 02:23 — with GitHub Actions Failure

freeliuzc approved these changes Feb 26, 2026

View reviewed changes

Jiang-Jia-Jun merged commit 0a5ad26 into PaddlePaddle:release/2.5 Feb 26, 2026
18 of 24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Cherry-Pick][OP][Feature] 统一 limit_thinking_content_length CUDA 算子，支持回复长度限制与注入序列#6511

[Cherry-Pick][OP][Feature] 统一 limit_thinking_content_length CUDA 算子，支持回复长度限制与注入序列#6511
Jiang-Jia-Jun merged 3 commits into
PaddlePaddle:release/2.5from
yuanlehome:copilot/refactor-and-merge-test-files_2.5

yuanlehome commented Feb 26, 2026 •

edited

Loading

Uh oh!

paddle-bot Bot commented Feb 26, 2026

Uh oh!

codecov-commenter commented Feb 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

yuanlehome commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

I tried to connect to the following addresses, but was blocked by firewall rules:

Uh oh!

paddle-bot Bot commented Feb 26, 2026

Uh oh!

codecov-commenter commented Feb 26, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yuanlehome commented Feb 26, 2026 •

edited

Loading