server: add OpenAI compatible response format for /completions #10627

Nero7991 · 2024-12-02T20:32:16Z

Support for full (almost) OpenAI API response format for the completion related endpoints (including when logprobs is specified)

The frontend is also modified to support this format as well as the existing format, so it remains functional.

HELM benchmarks from CRFM have support for a OpenAI compatible API server, this enables testing differently quantized models for degradation against this benchmark. Tested it on a QwQ Preview 32B GGUF Q4_K_M to evaluate the model against other frontier models.

This support can be compiled by using OAI_FULL_COMPAT pre compiler definition like so:

Using make:

make CXXFLAGS="-DOAI_FULL_COMPAT" llama-server

When compiled correctly and after running llama-server, the output should include INFO: OpenAI full compatibility mode enabled as seen in the following output snippet:

llama_new_context_with_model:  CUDA_Host compute buffer size =    18.01 MiB
llama_new_context_with_model: graph nodes  = 2246
llama_new_context_with_model: graph splits = 2
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 4096
INFO: OpenAI full compatibility mode enabled
main: model loaded
main: chat template, built_in: 1, chat_example: '<|im_start|>system

Example:

curl --request POST \
  --url http://localhost:8080/completion \
  --header "Content-Type: application/json" \
  --data '{"prompt": "Building a website can be done in 10 simple steps:","max_tokens": 8, "logprobs": 2}'

Response:

{
  "id": "cmpl-0",
  "id_slot": 0,
  "index": 0,
  "tokens_predicted": 8,
  "tokens_evaluated": 13,
  "generation_settings": {
    "n_ctx": 4096,
    "n_predict": -1,
    "model": "models/qwq-32b-preview-q4_k_m.gguf",
    "seed": 4294967295,
    "seed_cur": 3068603297,
    "temperature": 0.800000011920929,
    "dynatemp_range": 0,
    "dynatemp_exponent": 1,
    "top_k": 40,
    "top_p": 0.949999988079071,
    "min_p": 0.05000000074505806,
    "xtc_probability": 0,
    "xtc_threshold": 0.10000000149011612,
    "typical_p": 1,
    "repeat_last_n": 64,
    "repeat_penalty": 1,
    "presence_penalty": 0,
    "frequency_penalty": 0,
    "dry_multiplier": 0,
    "dry_base": 1.75,
    "dry_allowed_length": 2,
    "dry_penalty_last_n": -1,
    "dry_sequence_breakers": [
      "\n",
      ":",
      "\"",
      "*"
    ],
    "mirostat": 0,
    "mirostat_tau": 5,
    "mirostat_eta": 0.10000000149011612,
    "penalize_nl": false,
    "stop": [],
    "max_tokens": 8,
    "n_keep": 0,
    "n_discard": 0,
    "ignore_eos": false,
    "stream": false,
    "n_probs": 2,
    "min_keep": 0,
    "grammar": "",
    "samplers": [
      "dry",
      "top_k",
      "typ_p",
      "top_p",
      "min_p",
      "xtc",
      "temperature"
    ],
    "speculative": false,
    "speculative.n_max": 16,
    "speculative.n_min": 5,
    "speculative.p_min": 0.8999999761581421,
    "timings_per_token": false
  },
  "has_new_line": false,
  "truncated": false,
  "stopped_eos": false,
  "stopped_word": false,
  "stopped_limit": true,
  "stopping_word": "",
  "tokens_cached": 20,
  "timings": {
    "prompt_n": 13,
    "prompt_ms": 59.178,
    "prompt_per_token_ms": 4.552153846153846,
    "prompt_per_second": 219.67623103180236,
    "predicted_n": 8,
    "predicted_ms": 186.64,
    "predicted_per_token_ms": 23.33,
    "predicted_per_second": 42.86326618088299
  },
  "object": "text_completion",
  "created": 1733161457,
  "model": "models/qwq-32b-preview-q4_k_m.gguf",
  "choices": [
    {
      "text": " choosing a domain name, registering it,",
      "index": 0,
      "logprobs": {
        "tokens": [
          " choosing",
          " a",
          " domain",
          " name",
          ",",
          " registering",
          " it",
          ","
        ],
        "token_logprobs": [
          -0.8389889001846313,
          -0.03926413506269455,
          -0.09884411841630936,
          -0.04721870273351669,
          0,
          -0.5166370272636414,
          -0.494428426027298,
          0
        ],
        "top_logprobs": [
          {
            " ": -0.8389889001846313,
            " \n": -2.3360304832458496
          },
          {
            " a": -0.03926413506269455,
            " the": -3.2570128440856934
          },
          {
            " domain": -0.09884411841630936,
            " theme": -3.2751708030700684
          },
          {
            " name": -0.04721870273351669,
            ",": -3.076481342315674
          },
          {
            ",": 0
          },
          {
            " selecting": -0.5166370272636414,
            " registering": -1.518433690071106
          },
          {
            " the": -0.494428426027298,
            " a": -1.4887559413909912
          },
          {
            ",": 0
          }
        ],
        "text_offset": [
          0,
          9,
          11,
          18,
          23,
          24,
          36,
          39
        ]
      },
      "finish_reason": "length"
    }
  ],
  "usage": {
    "prompt_tokens": 13,
    "completion_tokens": 8,
    "total_tokens": 21
  }
}

ngxson · 2024-12-02T20:44:39Z

I think we can just replace the the /completion with OAI-compat, instead of hiding it behind -DOAI_FULL_COMPAT. I don't see anyone actually using the non-OAI-compat format, and OAI-compat is pretty much a standard today thanks to its portability. What do you think about this @ggerganov ?

Beside, @Nero7991 you should add a test in test_completion.py to make sure that this works correctly. You can start by from openai import OpenAI and ask copilot to complete the rest.

Nero7991 · 2024-12-03T03:19:11Z

@ngxson So test_completion.py is currently testing for the existing response format. Unless we're switching completely to OpenAI compat, we'd need to figure out what response type the server is compiled for right? Or should I create a new test_completion_oai_compat.py file?

ngxson

The idea is good overall, but I think this need a bit more refactoring to make it more "clean"

Also I don't think we should hide it via a compiler flag, it's just not convenient for most users. Another idea is that we can add a specific field for each request, says "oai_compat" and set it to true by default. Users who don't want OAI response need to explicit add "oai_compat": false

I'll propose my approach via another PR

ngxson · 2024-12-03T21:24:59Z

examples/server/server.cpp

                        send_final_response(slot);
+                        #else
+                        send_final_response_oaicompat(slot);


the send_final_* is called from inference thread, but what we're doing is only to format the response, which should be done at HTTP layer. I'd suggest to move your code to a new function format_final_*_oaicompat, much like what we have with format_final_response_oaicompat

I noticed this yesterday. I found that there's a function called handle_completions_generic (there was a TODO suggesting merging that with handle_chat_completions. I've done that. I'll create another PR with that since it's probably the right way to do it.

I can probably do the oai_compatset to true by default later and send a PR draft

ngxson · 2024-12-03T22:51:21Z

Be aware that I'm doing a big refactoring in #10643 to reduce usage of JSON internally. This can introduce quite a lot of conflicts to your code.

github-actions bot added examples server labels Dec 2, 2024

Nero7991 mentioned this pull request Dec 2, 2024

Is there any way to load GGUF models? stanford-crfm/helm#3141

Open

ngxson reviewed Dec 3, 2024

View reviewed changes

Nero7991 closed this Dec 4, 2024

Nero7991 force-pushed the oai_compat branch from 846b085 to cc98896 Compare December 4, 2024 00:02

Nero7991 mentioned this pull request Dec 4, 2024

server: add OpenAI compatible response format for legacy /completions with b… #10645

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

server: add OpenAI compatible response format for /completions #10627

server: add OpenAI compatible response format for /completions #10627

Uh oh!

Nero7991 commented Dec 2, 2024

Uh oh!

ngxson commented Dec 2, 2024 •

edited

Loading

Uh oh!

Nero7991 commented Dec 3, 2024

Uh oh!

ngxson left a comment

Uh oh!

ngxson Dec 3, 2024 •

edited

Loading

Uh oh!

Nero7991 Dec 3, 2024

Uh oh!

ngxson commented Dec 3, 2024

Uh oh!

Uh oh!

server: add OpenAI compatible response format for /completions #10627

server: add OpenAI compatible response format for /completions #10627

Uh oh!

Conversation

Nero7991 commented Dec 2, 2024

Uh oh!

ngxson commented Dec 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Nero7991 commented Dec 3, 2024

Uh oh!

ngxson left a comment

Choose a reason for hiding this comment

Uh oh!

ngxson Dec 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Nero7991 Dec 3, 2024

Choose a reason for hiding this comment

Uh oh!

ngxson commented Dec 3, 2024

Uh oh!

Uh oh!

ngxson commented Dec 2, 2024 •

edited

Loading

ngxson Dec 3, 2024 •

edited

Loading