JSON formatting issue #1191

rounak610 · 2023-09-27T07:28:04Z

How to get the response from vLLM in a proper JSON format.
Does vLLM support outlines, guidance or jsonformer libraries?

viktor-ferenczi · 2023-09-28T09:02:34Z

Indeed supporting guided generation would be more efficient, but I don't see adding support for any of the libraries you mentioned on the Roadmap.

Some workarounds until then

Use a coding model if you can, like CodeLlama or WizardCoder, because they are better trained on JSON. Airoboros should also work reasonably well. If you need a smaller one try microsoft/phi-1_5.
Request JSON output in the SYSTEM context or the prompt, for example:
- For base models end your prompt with : "The answer in JSON format is the following:"
- For instruct or chat models: "Answer only in JSON. Do NOT write anything else."
Models often like to respond with JSON in a code block (triple back-quotes). It is a good idea to find the first/last occurence of triple back-quotes and consider only the text in-between as the answer. Doing so would ignore any explanations may be added by the LLM. (They don't always follow prompts to the letter.)
Validate the model's output (JSON formatting and contents) as much as possible and retry if not valid.
After multiple unsuccessful retries consider increasing the temperature slightly.
Look for consistent mistakes the model makes in the formatting and implement automatic fixes for that. For example if it misses commas after items or the closing brace.
Consider switching to more text-like or flatter output format, like YAML or TOML. They can also be parsed in Python and converted to JSON if you need the data in that format.
Use streaming generation and abort the stream early if it goes off-rails. Retry the prompt while keeping the valid part of the output. It should be efficient due to caching in vLLM.

viktor-ferenczi · 2023-09-28T09:17:57Z

I think the best may be to add support (API) for the integration of guidance libraries, then provide an adapter for each of the popular libraries separately.

@WoosukKwon What do you think?

viktor-ferenczi · 2023-09-29T06:01:28Z

Related to #535

viktor-ferenczi · 2023-09-29T06:05:19Z

Mostly the same as #288

noamgat · 2023-11-08T06:44:02Z

LM Format Enforcer is a library that achieves this and supports vLLM.
There is already a sample notebook showing vLLM integration. It currently uses monkeypatching, that will be removed when the next vLLM version with the logits processing API will be released.

(Disclosure: I am the author of the library)
(Later edit - the version was released, and the example no longer uses monkeypatching)

arshadshk · 2023-11-22T06:12:46Z

@noamgat does LM Format Enforcer support api access to vllm ?

noamgat · 2023-11-22T06:14:42Z

@noamgat does LM Format Enforcer support api access to vllm ?

With LM Format Enforcer its the other way around - it integrates into the inference engine's pipeline. So if you have existing vLLM code, its easy to plug LMFE into it.

vLLM example notebook

wdhitchc · 2023-12-07T18:03:42Z

@noamgat

since its not an integration with the VLLM Server though, I cant just have an LLM deployment use the format enforcer against that, I need to have the model loaded locally/ in the sample place as the format enforcer code....

noamgat · 2023-12-07T18:27:10Z

Yes, this is a known issue.
In order to solve it, we would need to be able to pass "logits processor instructions" in the network request of VLLM server.
I proposed something similar to huggingface-inference-server in this draft PR meant for discussion, but did not get a response from the team yet so I didn't proceed with it.

If vLLM would be interested in adopting a similar solution, it would cause the LMFE to also work with server / multi GPU deployments.

viktor-ferenczi · 2023-12-07T22:47:41Z

That would be nice. Support should be added for LMQL as well, I think.

wdhitchc · 2023-12-07T23:20:12Z

Is that not what's happening here @noamgat @viktor-ferenczi #535

Any chance we can get this in?

noamgat · 2023-12-08T10:13:04Z

A similar PR to #535 (disclosure: mine) was already merged. This is how LMFE works with vllm - example notebook.

The challenge (that #535 does not cover) is how do we pass this information in a multiprocess / networked environment to the custom code. A serialization / "simple object" parametrization solution is required, which is what was proposed in the PR to huggingface-inference-server.

wdhitchc · 2023-12-14T01:48:10Z

@noamgat so maybe I'm not super clear on what's going on internally. I have not done my full due diligence and deep dived the code but....

I saw your PR that got in. I thought we just needed another one that utilizes what you put in, inside of the API/ openapi layer. My understanding was that the final comment on 535 was a request for him to rebase and utilize your code in the api layer.

I was thinking about making a dumb solution for myself where I actually install LMfE into VLLM, then create a new api endpoint that is essentially a parameterized version of the notebook you shared. I was thinking I'd have it inside the openapi compatible server and make it look like function calling. Not sure how efficient it would be but could get the job done quickly to accelerate me. This would probably just be a personal fork until the better solution gets in. I don't feel qualified to try and implement the correct pattern, but happy to give it a shot if you'll hold my hand a bit through the process.

rlouf · 2023-12-26T07:45:04Z

Outlines now provides a "fork" of vLLM's FastAPI deployment example, if that helps: https://outlines-dev.github.io/outlines/reference/vllm/

noamgat · 2023-12-26T08:21:58Z

Outlines now provides a "fork" of vLLM's FastAPI deployment example, if that helps: https://outlines-dev.github.io/outlines/reference/vllm/

That is super cool! Congrats on the release!
If you want to avoid monkey patching the vLLM logits processor API, you can cache according to the generated token tuple instead of seq_id. I'm not sure how seq_id behaves with facilities such as beam searching etc.

hmellor · 2024-03-28T12:06:50Z

Support for guided decoding using outlines was merged in #2819

viktor-ferenczi mentioned this issue Sep 30, 2023

Support for Constrained decoding #288

Closed

viktor-ferenczi mentioned this issue Nov 9, 2023

Attempt to pipe logit_bias to sampler's embedding_bias #1279

Closed

hmellor closed this as completed Mar 28, 2024

QwertyJack mentioned this issue May 27, 2024

[Feature] Guided Decoding InternLM/lmdeploy#1664

Closed

solesensei mentioned this issue Jul 18, 2024

Support JSON mode. #2483

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JSON formatting issue #1191

JSON formatting issue #1191

rounak610 commented Sep 27, 2023

viktor-ferenczi commented Sep 28, 2023 •

edited

Loading

viktor-ferenczi commented Sep 28, 2023

viktor-ferenczi commented Sep 29, 2023

viktor-ferenczi commented Sep 29, 2023

noamgat commented Nov 8, 2023 •

edited

Loading

arshadshk commented Nov 22, 2023

noamgat commented Nov 22, 2023

wdhitchc commented Dec 7, 2023

noamgat commented Dec 7, 2023

viktor-ferenczi commented Dec 7, 2023

wdhitchc commented Dec 7, 2023 •

edited

Loading

noamgat commented Dec 8, 2023

wdhitchc commented Dec 14, 2023

rlouf commented Dec 26, 2023

noamgat commented Dec 26, 2023

hmellor commented Mar 28, 2024

JSON formatting issue #1191

JSON formatting issue #1191

Comments

rounak610 commented Sep 27, 2023

viktor-ferenczi commented Sep 28, 2023 • edited Loading

viktor-ferenczi commented Sep 28, 2023

viktor-ferenczi commented Sep 29, 2023

viktor-ferenczi commented Sep 29, 2023

noamgat commented Nov 8, 2023 • edited Loading

arshadshk commented Nov 22, 2023

noamgat commented Nov 22, 2023

wdhitchc commented Dec 7, 2023

noamgat commented Dec 7, 2023

viktor-ferenczi commented Dec 7, 2023

wdhitchc commented Dec 7, 2023 • edited Loading

noamgat commented Dec 8, 2023

wdhitchc commented Dec 14, 2023

rlouf commented Dec 26, 2023

noamgat commented Dec 26, 2023

hmellor commented Mar 28, 2024

viktor-ferenczi commented Sep 28, 2023 •

edited

Loading

noamgat commented Nov 8, 2023 •

edited

Loading

wdhitchc commented Dec 7, 2023 •

edited

Loading