[rollout] feat: Add LangGraph chat scheduler for agent training with tool use support #1

waleko · 2025-06-06T16:05:21Z

No description provided.

# Conflicts: # setup.py # verl/workers/rollout/vllm_rollout/vllm_async_server.py

…heduler.py with simple two-step LLM graph - Test generate_sequences method with revision workflow (LLM -> revision prompt -> LLM) - Verify message counts, shapes, and response structure - No tools or complex agents, just basic LLM interaction flow

# Conflicts: # .github/workflows/vllm.yml # verl/workers/rollout/async_server.py

# Conflicts: # verl/trainer/ppo/ray_trainer.py

…engine#2365) ### What does this PR do? Fix a regression from volcengine#1911, because the PR did not change the sglang async branch. CI did not catch this error because it only run 1 step, but this error happen in the second test. So I update the testcases to run 2 steps. To reproduce the bug, run test: TOTAL_TRAIN_STEPS=2 ENGINE=sglang ROLLOUT_MODE=async bash tests/special_e2e/ppo_trainer/run_function_reward.sh It fail with: ``` (WorkerDict pid=1257286) Total steps: 2, num_warmup_steps: 0 (WorkerDict pid=1257286) Actor use_remove_padding=True (WorkerDict pid=1257286) Actor use_fused_kernels=False (AsyncSglangServer pid=1260392) FastAPI listen on [192.168.111.48:40451](http://192.168.111.48:40451/) (WorkerDict pid=1257286) terminate called after throwing an instance of 'c10::Error' (WorkerDict pid=1257286) what(): CUDA error: an illegal memory access was encountered (WorkerDict pid=1257286) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (WorkerDict pid=1257286) For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (WorkerDict pid=1257286) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. (WorkerDict pid=1257286) (WorkerDict pid=1257286) Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first): (WorkerDict pid=1257286) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fbf6036c1b6 in /usr/local/lib/python3.10/dist-packages/torch/lib/[libc10.so](http://libc10.so/)) (WorkerDict pid=1257286) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fbf60315a76 in /usr/local/lib/python3.10/dist-packages/torch/lib/[libc10.so](http://libc10.so/)) (WorkerDict pid=1257286) frame volcengine#2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fbf6080d918 in ``` ### Checklist Before Starting - [X] Search for similar PRs. Paste at least one query link here: https://github.com/volcengine/verl/issues?q=is%3Aissue%20state%3Aopen%20an%20illegal%20memory%20access%20was%20encountered - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test ``` (TaskRunner pid=1647269) step:2 - global_seqlen/min:13075 - global_seqlen/max:14837 - global_seqlen/minmax_diff:1762 - global_seqlen/balanced_min:14231 - global_seqlen/balanced_max:14232 - global_seqlen/mean:14231.5 - actor/entropy:2.0606913566589355 - critic/vf_loss:8.7157882153 ``` ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes > List the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [ X] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [X] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).

waleko added 5 commits June 3, 2025 12:58

langgraph poc

81134f6

working state

dfcb735

Merge remote-tracking branch 'origin/main' into dev

fb53a60

move to another dir; add graph config

9ad4236

fix script

3da6270

waleko changed the title ~~[rollout] feat: Add LangGraph chat scheduler for agent training with tool use support~~ [example] feat: Add LangGraph chat scheduler for agent training with tool use support Jun 6, 2025

waleko added 2 commits June 10, 2025 13:15

Merge remote-tracking branch 'refs/remotes/origin/main' into dev

6c57f05

# Conflicts: # setup.py # verl/workers/rollout/vllm_rollout/vllm_async_server.py

add license

fe40704

waleko changed the title ~~[example] feat: Add LangGraph chat scheduler for agent training with tool use support~~ [rollout] feat: Add LangGraph chat scheduler for agent training with tool use support Jun 10, 2025

waleko changed the title ~~[rollout] feat: Add LangGraph chat scheduler for agent training with tool use support~~ ~[rollout] feat: Add LangGraph chat scheduler for agent training with tool use support~ Jun 10, 2025

waleko closed this Jun 10, 2025

waleko changed the title ~~~[rollout] feat: Add LangGraph chat scheduler for agent training with tool use support~~~ [rollout] feat: Add LangGraph chat scheduler for agent training with tool use support Jun 10, 2025

waleko reopened this Jun 10, 2025

waleko and others added 10 commits June 10, 2025 19:49

Add langgraph chat scheduler tests; revert chat scheduler breaking ch…

3119b7a

…anges (volcengine#1831)

Merge remote-tracking branch 'origin/main' into dev

9d2f85c

remove ray_stop_all

0cf3242

Fix langgraph chat scheduler after merge

c053b11

Merge remote-tracking branch 'origin/main' into dev

063dba7

# Conflicts: # .github/workflows/vllm.yml # verl/workers/rollout/async_server.py

Merge remote-tracking branch 'origin/main' into dev

24bb6ed

# Conflicts: # verl/trainer/ppo/ray_trainer.py

move test to vllm folder

bd07f49

pre-commit fixes

f36f88b

fix tests

7813c0c

waleko closed this Jul 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[rollout] feat: Add LangGraph chat scheduler for agent training with tool use support #1

[rollout] feat: Add LangGraph chat scheduler for agent training with tool use support #1

Uh oh!

waleko commented Jun 6, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[rollout] feat: Add LangGraph chat scheduler for agent training with tool use support #1

[rollout] feat: Add LangGraph chat scheduler for agent training with tool use support #1

Uh oh!

Conversation

waleko commented Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

waleko commented Jun 6, 2025 •

edited

Loading