[Cherry-Pick][OP][Feature] 统一 limit_thinking_content_length CUDA 算子,支持回复长度限制与注入序列#6511
Merged
Jiang-Jia-Jun merged 3 commits intoFeb 26, 2026
Conversation
…addlePaddle#6493) * Initial plan * Migrate PRs PaddlePaddle#6311, PaddlePaddle#6129, PaddlePaddle#6305 to develop and merge unit tests Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com> * fix * update * fix * fix ci * fix ci * Initial plan * test: add test_chat_with_response_max_tokens to test_EB_VL_Lite_serving.py Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com> * test: add disable-thinking case to test_chat_with_response_max_tokens Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com> * test: add both reasoning_max_tokens and response_max_tokens case Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com> * fix ci * fix ci * fix ci --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com>
|
Thanks for your contribution! |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## release/2.5 #6511 +/- ##
==============================================
Coverage ? 68.49%
==============================================
Files ? 391
Lines ? 52802
Branches ? 8220
==============================================
Hits ? 36167
Misses ? 14004
Partials ? 2631
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
freeliuzc
approved these changes
Feb 26, 2026
0a5ad26
into
PaddlePaddle:release/2.5
18 of 24 checks passed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
将 develop 上 PR #6493 的核心能力迁移到
release/2.5:统一 thinking 截断 CUDA 算子,并引入response_max_tokens端到端支持。该迁移同时对 release 分支现有目录与调用差异做了对齐,保持行为与上游改动一致。Modifications
算子统一(OP / SpecDecode)
limit_thinking_content_length_v1/v2为统一 GPU 算子limit_thinking_content_length。speculate_limit_thinking_content_length_v1/v2为统一推测解码算子speculate_limit_thinking_content_length。cpp_extensions.cc导出符号与setup_ops.py编译源,移除旧 v1/v2 文件。推理后处理链路改造(Executor / Worker)
pre_and_post_process.py移除按limit_strategy分发的 Python 包装逻辑,直接调用统一算子。gpu_model_runner.py新增并维护:max_reply_lensinject_token_idssplitwise_role_is_decode传递response_max_tokens 端到端支持(APIServer / DataProcessor / Engine)
response_max_tokens字段。engine_client增加参数校验(response_max_tokens > 0)。max_tokens上限裁剪逻辑,并在enable_thinking=False时应用response_max_tokens约束。reasoning_max_tokens校验口径同步(允许0),并移除 D 侧reasoning_max_tokens -= 1的历史修正。配置与启动参数透传
think_truncate_prompt_ids配置读取、worker 启动参数传递与共享输入注入,支持任意长度注入序列。测试同步
示例(统一算子调用形态):
Usage or Command
新增请求参数(OpenAI 兼容):
{ "max_tokens": 256, "reasoning_max_tokens": 128, "response_max_tokens": 64 }语义:
reasoning_max_tokens:限制思考阶段 token 数;response_max_tokens:限制回复阶段 token 数(thinking 关闭时直接约束总回复长度)。Accuracy Tests
该 PR 涉及 kernel 与后处理路径重构,准确性结果待补充(与 develop 对齐策略)。
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.Warning
Firewall rules blocked me from connecting to one or more addresses (expand for details)
I tried to connect to the following addresses, but was blocked by firewall rules:
https://api.github.com/repos/PaddlePaddle/FastDeploy/pulls/6493/files/home/REDACTED/work/_temp/ghcca-node/node/bin/node /home/REDACTED/work/_temp/ghcca-node/node/bin/node --enable-source-maps /home/REDACTED/work/_temp/copilot-developer-action-main/dist/index.js(http block)If you need me to access, download, or install something from one of these locations, you can either:
🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. Learn more about Advanced Security.