-
Notifications
You must be signed in to change notification settings - Fork 14.2k
[WIP] tool-call: experimental migration of all parsers to peg-parser infra (w/ better test coverage) #18353
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Core PEG parser infrastructure improvements: - Replace string AST tags with integer tag IDs for type safety and faster dispatch - Mapper functions defined as lambdas to reduce boilerplate - Add token_tag and literal_tag helper methods for token-aware parsing - Improve GBNF grammar generation for optional repetitions and lazy mode - Make SPACE_RULE optional in JSON schema grammar generation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Migrate 28 chat template formats to the unified PEG parser approach: - Each format gets its own module in common/chat-parsers/ - Shared internal header for common patterns and utilities - XML toolcall infrastructure for formats using XML-style tool calls - Updated routing in chat.cpp and chat-parser.cpp Formats migrated: Apertus, Apriel 1.5, Command R7B, DeepSeek R1/V3.1, Firefunction V2, Functionary V3.1/V3.2, Generic, GLM 4.5, GPT-OSS, Granite, Hermes 2 Pro, Kimi K2, LFM2, Llama 3.x, Magistral, MiniMax M2, Ministral 3, Mistral Nemo, Nemotron V2/V3, Qwen3 Coder XML, Seed OSS, Xiaomi MIMO. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add systematic streaming test coverage for all 28 parser formats: - Needle streaming tests inject unique markers into each semantic field - Tests verify incremental parsing at every character boundary - Cross-check declared capabilities against minja's runtime detection - 21 formats x 6+ scenarios = 126+ regression tests The needle technique catches: - Content loss or truncation (both needles must be present) - Out-of-order emission and buffering bugs (N1 before N2 per pair) - Function name splitting during streaming (atomic tool names) - Tool argument regression (args never get shorter) - Key ordering violations (key N finishes before key N+1) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add feature flag infrastructure for safe migration: - --experimental-new-parsers CLI flag (defaults to off) - LLAMA_USE_NEW_PARSERS env var for testing - Dual-path testing: legacy parsers active by default - Server integration for runtime parser selection Also includes: - Nemotron Nano template fix - Granite template rename for consistency - Test model fetching updates 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When tool_choice is "required", the model MUST output tool calls without any preceding content. This change ensures that the following parsers enforce this constraint: - apertus.cpp - glm-4-5.cpp - lfm2.cpp - minimax-m2.cpp - nemotron-v2.cpp - nemotron-v3.cpp - qwen3-coder-xml.cpp Previously, these parsers would accept content before tool calls even in required mode, allowing the model to bypass the tool call requirement. Now, when require_tools is true, the parser only matches tool calls (with optional reasoning blocks where supported). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
|
Holy...
I'll have to look more closely, but nothing particular stands out as disagreeable.
Is this in reference to the raw string parameters that circumvent I intend to add a schema optimization step that should help a bit. Unfortunately, I'm starting to think we might have to loosen the grammar constraining and rely more on the agent feedback loop if model creators no longer want to emit JSON. |
|
@ochafik actually I've been working on an alternative approach - an automatic peg parser generator based on differential template output analysis that will handle like 90% of all templates out of the box - then we only have to handle the really hard/weird/atypical ones. |
Consolidates the common pattern in XML-based chat parsers for handling schema-based parameter values: - For string schemas: use until[_max](delimiter, maxLength?) - For non-string schemas: use schema() with optional space wrapping This reduces boilerplate across glm-4-5, minimax-m2, qwen3-coder-xml, and seed-oss parsers (removes ~99 lines, adds ~94 lines with docs). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
These templates' parsers don't allow content in tool_choice=required mode, but the test was generating content because tool_required_allows_content defaulted to true. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@aldehir Sorry no, I meant for the handling of Without a clean way to escape xml arg end blocks in string arguments (and corresponding fine-tuning) all the xml formats seem inherently fragile, unless we accept they contain a JSON string (which I'd done in my first Qwen3 parser). XML-esque escaping of Alternatively we could introduce token-based parsing (e.g. Re/ strings args, I've added a schema_or_raw_string_until to do the same treatment across the afflicted parsers.
@pwilkin this sounds super exciting, I wanted to do this in Minja at some point but never finished my prototype / got desperate as i realized how diverse the practices were in Jinja templates 😓 But more importantly... lots of small models output tool calls with an alarmingly high creativity, not at all guessable by just analyzing templates (Qwen 2.5 comes to my mind: I had to sample many tool calls to see what it comes up in practice, hence the horribly long and accommodating list of triggers in the Hermes 2 grammar & parser). Not to mention outright broken templates. (could you integrate alternative syntaxes by having multiple templates feed into the same diff engine?) Anyway, would love to see what you come up with! (oh and btw, feel free to host this in https://github.com/ochafik/minja if it needs too much testing infra for llama.cpp: it already has all the templates, could be a good fit / playground) |
Oh, interesting. I thought this would be handled by delegating to I came to the same conclusions with the XML formats. The vLLM tool call parsing seems affected by the same problems. I'm not sure there's much more we can do.
I worry this will impact model performance. I've noticed that with Devstral (not an XML format), heavy constraining is detrimental to its performance. I rather accept that it will fail when // GBNF doesn't have token-awareness, so same as text-based untilIt does now (#17816).I was slowly working my way to making the parser token aware, but for the same reasons--i.e. the argument delimiter is not a unique token--it became a low priority. I mainly added it in to handle grammar enforced reasoning budget. |
- Remove unused skip_content_before/skip_content_after in nemotron-v3.cpp - Fix trailing whitespace in test-chat.cpp 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2c68c82 to
e8aa015
Compare
- Update SPACE_RULE from `| " " | "\n"...` to `( " " | "\n"... )?` in Python and JavaScript implementations to match C++ changes - Update test-json-schema-to-grammar.cpp expected outputs - Update test-grammar-parser.cpp expectations for empty group parsing 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
e8aa015 to
427afe6
Compare
The template file had Windows-style CRLF line endings which were causing test failures on Windows CI when parsing reasoning content. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
There's a desire to refactor the common chat API (ref #18215). I was working on this, but given this PR and @pwilkin's effort, I'll wait to see what direction we go. Otherwise merging will be a challenge. This PR accomplishes part of what I had in mind: splitting the parsers into separate files. I'd also like to maintain a stateful object during an entire generation request. That would let us remove the serialization mechanics in the PEG parser and implement a variant of incremental Packrat parsing (https://ohmjs.org/pubs/sle2017/incremental-packrat-parsing.pdf). |
@aldehir We could experiment w/ a bit of jinja hackery: ideally we'd want models to think of escaping Longer term we should probably start a page about best practices for template writers / finetuning folks (how to pick your tool call syntax and make it work reliably by using the right special tokens, how to escape things properly, etc).
Sounds fab, just note that in the OG tool calls support PR it was pointed out sharing objects between the two sides could be a bit iffy wrt/ threading. |
Parser fixes: - Fix 6 parsers to reject content in tool_choice=required mode (llama-3-x, functionary-v3-1/v3-2, hermes-2-pro, gpt-oss, mistral-nemo) - Required mode now only allows tool calls (and optionally thinking) Test improvements: - Add test_required_tool_rejects_content using init_delta for proper template rendering instead of manual tag construction - Create shared get_template_capabilities() function to avoid duplication - Convert template_capabilities bools to named scoped enums for type safety: ThinkingSupport, ToolSupport, Skip, ReasoningRequiresTools, ToolsEmitContentWithCalls, InjectReasoningAfterFormat, SupportsDisableThinking, SupportsReasoningOnly, ToolCallsHaveIds - Remove tool_required_allows_content field (was documenting buggy behavior) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Adds test_format_detection_with_tools() to verify that when experimental_new_parsers is enabled with tools, templates correctly detect their expected format (not CONTENT_ONLY) and generate non-empty grammar and parser. This catches "Pattern 1" failures where templates are incorrectly detected as Content-only when they should have a proper tool-calling format. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
README.md was accidentally included in the list of jinja templates to test, causing test failures. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- removed old grammars that had slopped into new peg parsers - fence triggers by inputs.tool_choice != COMMON_CHAT_TOOL_CHOICE_REQUIRED checks - derive laziness from presence of triggers - foreach_function gives resolved parameters to everyone!
TL;DR: it's a lot, but there's a lot more testing than before.
Building on the PEG parser infrastructure introduced in #17136 by @aldehir, this is an experimental effort to migrate all chat template formats to the unified PEG approach.
Why migrate? The current monolithic
common/chat.cpphas grown to ~25 ad-hoc parser implementations that are difficult to maintain. Lots of parsing bugs are hard to reproduce and diagnose (esp. if the user wasn't in--verbosemode).The PEG infrastructure offers a cleaner path forward, w/ strong guarantees (modulo bugs) that what is allowed to be generated should be parseable.
How to Test
Changes:
common/chat-parsers/*.cpp- 28 modular parser implementations--experimental-new-parsers- defaults to off, nothing changes by defaultNew "Needle" Streaming Tests
Existing streaming tests (
tools/server/tests/unit/test_tool_call.py) required loading real models and cover only a subset of formats. This PR adds systematic coverage for all 21 formats without the model-loading overhead.This migration was designed to be safe through systematic test constraints:
21 formats x 6+ scenarios = up to 126 regression tests (some scenarios filtered based on format capabilities)
Each format tests:
How Needle Tests Work
The "needle" technique injects unique marker pairs into each semantic field. For example, in Hermes 2 Pro format with thinking and a tool call:
The test parses this message at every character boundary (simulating streaming), and verifies:
This aims to prove parsers are truly incremental: partial input produces partial output, fields stream in proper order, and nothing is buffered unnecessarily.
Known Limitations
The PEG implementation has gaps vs legacy (TBC):
allOf/anyOf/$refpatterns not fully handleduntil_maxw/ weird implementation, maybe we just drop maxLength on xml formats)Proposed Migration Plan
--experimental-new-parserscommon/chat-parser.cpp: ~28 legacy parser functions (~900 lines)common/chat.cpp: ~19 legacy init functions (~600 lines)common/chat-peg-parser.cpp/.h: class-based builders/mappers (~220 lines)common/chat-parser-xml-toolcall.cpp/.h: XML grammar builder (~900 lines) - new PEG parsers generate grammars directly from their parser definitionsFollow up work
supports_tool_call_id- Whether tool calls include IDsreasoning_requires_tools- Whether thinking mode only works with toolstools_emit_content_with_calls- Whether tool calls can include content