Skip to content

[Cherry-Pick][OP][Feature] 统一 limit_thinking_content_length CUDA 算子,支持回复长度限制与注入序列#6511

Merged
Jiang-Jia-Jun merged 3 commits into
PaddlePaddle:release/2.5from
yuanlehome:copilot/refactor-and-merge-test-files_2.5
Feb 26, 2026
Merged

[Cherry-Pick][OP][Feature] 统一 limit_thinking_content_length CUDA 算子,支持回复长度限制与注入序列#6511
Jiang-Jia-Jun merged 3 commits into
PaddlePaddle:release/2.5from
yuanlehome:copilot/refactor-and-merge-test-files_2.5

Conversation

@yuanlehome
Copy link
Copy Markdown
Collaborator

@yuanlehome yuanlehome commented Feb 26, 2026

Motivation

将 develop 上 PR #6493 的核心能力迁移到 release/2.5:统一 thinking 截断 CUDA 算子,并引入 response_max_tokens 端到端支持。该迁移同时对 release 分支现有目录与调用差异做了对齐,保持行为与上游改动一致。

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

  • 算子统一(OP / SpecDecode)

    • 合并 limit_thinking_content_length_v1/v2 为统一 GPU 算子 limit_thinking_content_length
    • 合并 speculate_limit_thinking_content_length_v1/v2 为统一推测解码算子 speculate_limit_thinking_content_length
    • 更新 cpp_extensions.cc 导出符号与 setup_ops.py 编译源,移除旧 v1/v2 文件。
  • 推理后处理链路改造(Executor / Worker)

    • pre_and_post_process.py 移除按 limit_strategy 分发的 Python 包装逻辑,直接调用统一算子。
    • gpu_model_runner.py 新增并维护:
      • max_reply_lens
      • inject_token_ids
      • splitwise_role_is_decode 传递
    • thinking/reply 限制在 normal/speculative 两条路径统一生效。
  • response_max_tokens 端到端支持(APIServer / DataProcessor / Engine)

    • OpenAI 协议与采样参数新增 response_max_tokens 字段。
    • engine_client 增加参数校验(response_max_tokens > 0)。
    • 多个输入处理器统一 max_tokens 上限裁剪逻辑,并在 enable_thinking=False 时应用 response_max_tokens 约束。
    • reasoning_max_tokens 校验口径同步(允许 0),并移除 D 侧 reasoning_max_tokens -= 1 的历史修正。
  • 配置与启动参数透传

    • 新增 think_truncate_prompt_ids 配置读取、worker 启动参数传递与共享输入注入,支持任意长度注入序列。
  • 测试同步

    • 删除与旧 v1/v2 算子强绑定的两份 operator 测试文件。
    • 同步更新受影响的输入处理与 E2E 用例断言。

示例(统一算子调用形态):

limit_thinking_content_length(
    sampled_token_ids,
    share_inputs["max_think_lens"],
    share_inputs["max_reply_lens"],
    share_inputs["step_idx"],
    share_inputs["limit_think_status"],
    share_inputs["stop_flags"],
    share_inputs["eos_token_id"],
    share_inputs["inject_token_ids"],
    think_end_id,
    splitwise_role_is_decode,
)

Usage or Command

新增请求参数(OpenAI 兼容):

{
  "max_tokens": 256,
  "reasoning_max_tokens": 128,
  "response_max_tokens": 64
}

语义:

  • reasoning_max_tokens:限制思考阶段 token 数;
  • response_max_tokens:限制回复阶段 token 数(thinking 关闭时直接约束总回复长度)。

Accuracy Tests

该 PR 涉及 kernel 与后处理路径重构,准确性结果待补充(与 develop 对齐策略)。

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • https://api.github.com/repos/PaddlePaddle/FastDeploy/pulls/6493/files
    • Triggering command: /home/REDACTED/work/_temp/ghcca-node/node/bin/node /home/REDACTED/work/_temp/ghcca-node/node/bin/node --enable-source-maps /home/REDACTED/work/_temp/copilot-developer-action-main/dist/index.js (http block)

If you need me to access, download, or install something from one of these locations, you can either:


🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. Learn more about Advanced Security.

…addlePaddle#6493)

* Initial plan

* Migrate PRs PaddlePaddle#6311, PaddlePaddle#6129, PaddlePaddle#6305 to develop and merge unit tests

Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com>

* fix

* update

* fix

* fix ci

* fix ci

* Initial plan

* test: add test_chat_with_response_max_tokens to test_EB_VL_Lite_serving.py

Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com>

* test: add disable-thinking case to test_chat_with_response_max_tokens

Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com>

* test: add both reasoning_max_tokens and response_max_tokens case

Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com>

* fix ci

* fix ci

* fix ci

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com>
@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented Feb 26, 2026

Thanks for your contribution!

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 65.04854% with 36 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (release/2.5@a368856). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/input/v1/ernie4_5_processor.py 41.66% 4 Missing and 3 partials ⚠️
.../v1/ernie4_5_vl_processor/ernie4_5_vl_processor.py 25.00% 4 Missing and 2 partials ⚠️
fastdeploy/entrypoints/engine_client.py 0.00% 3 Missing and 2 partials ⚠️
fastdeploy/input/v1/text_processor.py 57.14% 3 Missing ⚠️
fastdeploy/input/ernie4_5_processor.py 77.77% 1 Missing and 1 partial ⚠️
...put/ernie4_5_vl_processor/ernie4_5_vl_processor.py 75.00% 1 Missing and 1 partial ⚠️
fastdeploy/input/text_processor.py 75.00% 1 Missing and 1 partial ⚠️
...1/paddleocr_vl_processor/paddleocr_vl_processor.py 50.00% 1 Missing and 1 partial ⚠️
.../input/v1/qwen3_vl_processor/qwen3_vl_processor.py 50.00% 1 Missing and 1 partial ⚠️
...oy/input/v1/qwen_vl_processor/qwen_vl_processor.py 50.00% 1 Missing and 1 partial ⚠️
... and 3 more
Additional details and impacted files
@@              Coverage Diff               @@
##             release/2.5    #6511   +/-   ##
==============================================
  Coverage               ?   68.49%           
==============================================
  Files                  ?      391           
  Lines                  ?    52802           
  Branches               ?     8220           
==============================================
  Hits                   ?    36167           
  Misses                 ?    14004           
  Partials               ?     2631           
Flag Coverage Δ
GPU 68.49% <65.04%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Jiang-Jia-Jun Jiang-Jia-Jun merged commit 0a5ad26 into PaddlePaddle:release/2.5 Feb 26, 2026
18 of 24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants