Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement structured engine for parsing json grammar by token with response_format: {type: json_object} #3328

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

pathorn
Copy link

@pathorn pathorn commented Mar 12, 2024

In contrast to the recently-added "guided json" feature, this introduces an "unguided" JSON format which runs as a logits_processor in constant time (See my post about JSON Mode on the DeepInfra blog).

The feature exposes the feature in the OpenAI chat and completions API as "response_format": {"type": "json_object"} to match the OpenAI spec.

A major caveat with the current implementation is it performs a lot of slow preprocessing steps in python for each token.... On a model such as Llama with 32k tokens in its tokenizer, this adds 15 seconds to startup time on Llama, and it contributes multiple minutes of startup time on Gemma due to its large number of tokens (256k). Due to the slow startup time, this feature is disabled by default and must be requested with --enable-json-mode on the command line.

Despite the slow startup time, the reason I used this design is it guarantees that runtime execution has (roughly) no overhead (much less than 1% overhead in my testing).

As it is, we are currently running this in production on all of our models.

I have some comments in the code, but there is still a lot of very complex, nuanced code. if anyone would like me to walk through the code with them, I'd be happy to add more explanation comments, or perhaps hop on a discord call or something and explain parts of it.

Future work:

  • The "StructureExecutionEngine" is designed in such a way that adding schema validation in real-time may be possible with little or no runtime performance overhead, but will require work. This is not so important due to the existence of guided_json, but it could offer an alternative with different performance characteristics.
  • If there is interest, the startup overhead could be significantly reduced by porting this code to a more efficient language: It would be a very good fit for rust, but C++ would also work. I believe this to be one of the problems for which python is 100 times slower than C, so I expect that a rewrite in C++ would reduce startup overhead to a fraction of a second for most models (even multi-threadable) and allow this feature to be enabled by default.
  • If we did rewrite this engine in C++, it might be fast enough to parse user supplied grammars, which could open up other possible applications.

@simon-mo
Copy link
Collaborator

Thank you for the PR. I have another simpler implementation based on context-free grammar in #3211. Can you elaborate on the differences?

Copy link

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

@github-actions github-actions bot added the stale label Oct 29, 2024
@mergify mergify bot added the frontend label Oct 29, 2024
Copy link

mergify bot commented Oct 29, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @pathorn please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants