Implement structured engine for parsing json grammar by token with response_format: {type: json_object}
#3328
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In contrast to the recently-added "guided json" feature, this introduces an "unguided" JSON format which runs as a
logits_processor
in constant time (See my post about JSON Mode on the DeepInfra blog).The feature exposes the feature in the OpenAI chat and completions API as
"response_format": {"type": "json_object"}
to match the OpenAI spec.A major caveat with the current implementation is it performs a lot of slow preprocessing steps in python for each token.... On a model such as Llama with 32k tokens in its tokenizer, this adds 15 seconds to startup time on Llama, and it contributes multiple minutes of startup time on Gemma due to its large number of tokens (256k). Due to the slow startup time, this feature is disabled by default and must be requested with
--enable-json-mode
on the command line.Despite the slow startup time, the reason I used this design is it guarantees that runtime execution has (roughly) no overhead (much less than 1% overhead in my testing).
As it is, we are currently running this in production on all of our models.
I have some comments in the code, but there is still a lot of very complex, nuanced code. if anyone would like me to walk through the code with them, I'd be happy to add more explanation comments, or perhaps hop on a discord call or something and explain parts of it.
Future work: