Skip to content

Problems with the response of the OpenAI-Compatible Frontend for Triton Inference Server #7796

Open
@DimadonDL

Description

@DimadonDL

Hi,

i have installed Triton with vllm backend and also the OpenAI-Compatible Frontend for Triton Inference Server (Beta). The model is meta-llama/Llama-3.1-8B-Instruct. Now when I call the Endpoint for example like this:

MODEL="llama-3.1-8b-instruct"
curl -s http://localhost:9000/v1/chat/completions -H 'Content-Type: application/json' -d '{
  "model": "'${MODEL}'",
  "messages": [{"role": "user", "content": "Why is the sky blue?"}]
}' | jq

The Response is:

{
  "id": "cmpl-276d1a84-a293-11ef-b088-d404e69cb4ea",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "The sky appears blue because of a phenomenon called Rayleigh scattering, named after the",
        "tool_calls": null,
        "role": "assistant",
        "function_call": null
      },
      "logprobs": null
    }
  ],
  "created": 1731593837,
  "model": "vllm_model",
  "system_fingerprint": null,
  "object": "chat.completion",
  "usage": null
}

As you can see the content is cropped. I have played with the config but I don't know what's the problem. With Python the response is fine:

from openai import OpenAI

client = OpenAI(
    base_url="http://192.168.175.242:9000/v1",
    api_key="EMPTY",
)

model = "vllm_model"
completion = client.chat.completions.create(
    model=model,
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant.",
        },
        {"role": "user", "content": "Why is the sky blue?"},
    ],
    max_tokens=4096,
)

print(completion.choices[0].message.content)

The response here is:

The sky appears blue due to a phenomenon called Rayleigh scattering. This is a scientific explanation:

  1. Sunlight and the Atmosphere: When sunlight enters the Earth's atmosphere, it encounters tiny molecules of gases such as nitrogen (N2) and oxygen (O2). These molecules are much smaller than the wavelength of light.

  2. Scattering of Light: According to the Rayleigh scattering theory, when light travels through the atmosphere, it encounters these tiny molecules. The shorter (blue) wavelengths of light are scattered more than the longer (red) wavelengths. This scattering of light in all directions is what gives the sky its blue color.

  3. Blue Light Dominates: Due to the scattering effect, the blue light is distributed throughout the atmosphere, reaching our eyes from all directions. As a result, the sky appears blue. This is why we see a blue sky during the daytime.

  4. Time of Day and Atmospheric Conditions: The color of the sky can change depending on the time of day and atmospheric conditions. During sunrise and sunset, the light has to travel longer distances through the atmosphere, which scatters the shorter wavelengths even more, making the sky appear red or orange. On a cloudy day, the scattered light is blocked, making the sky appear gray or white.

In summary, the sky appears blue due to the scattering of sunlight by the tiny molecules in the atmosphere, with blue light being scattered more than other colors.

My model.json is:

{
    "model":"meta-llama/Llama-3.1-8B-Instruct",
    "disable_log_requests": true,
    "gpu_memory_utilization": 0.9,
    "enforce_eager": true,
    "tensor_parallel_size": 4,
    "max_model_len": 50000
}

I have the same problem with 4096 max_model_len.

It would be great if someone can help me here.

Hardware: 4 GPUS NVIDIA L4 with 96 GB VRAM.

Thanks 👍

Metadata

Metadata

Assignees

Labels

module: frontendsIssues related to the triton frontends

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions