Skip to content

Chat completions Streaming TTFT delay with llama-stack-client (TypeScript) vs OpenAI SDK (TypeScript) #53

@KodieGlosserIBM

Description

@KodieGlosserIBM

When streaming chat completions from our server endpoint, we consistently see an additional delay before the first streamed chunk (TTFT) when using the LlamaStack TypeScript client, while the OpenAI Node SDK starts streaming almost immediately under the same conditions. The behavior is reproducible across the latest release and v0.4.0‑alpha.7 of llama-stack-client. The issue appears specific to the LlamaStack TypeScript SDK’s streaming path.

Environment

  • Node.js: v22.4.1
  • LlamaStack TypeScript client versions tested:
    • llama-stack-client (latest from npm at time of filing)
    • llama-stack-client@0.4.0-alpha.7
  • OpenAI Node SDK: openai (latest)
  • Request: streaming chat completion (stream: true)
    const stream = await client.chat.completions.create({
        model: model,
        stream: true,
        messages: [
            { role: "system", content: "You are a helpful assistant." },
            { role: "user", content: "Explain SSE streaming in one paragraph." },
        ],
        temperature: 0.7,
    });

Will attach full script. Change the import statement to switch between openai/llamastack.

stream-chat.js

Evidence

Reproducible on all servers we tested including a local instance.

LlamaStack (remote serve) — ~2.0s delay before first chunk

HTTP 200
alt-svc: h3=":443"; ma=93600
cache-control: max-age=0, no-cache, no-store
connection: keep-alive, Transfer-Encoding
content-type: text/event-stream;charset=utf-8
date: Tue, 27 Jan 2026 23:56:57 GMT
expires: Tue, 27 Jan 2026 23:56:57 GMT
pragma: no-cache
server-timing: cdn-cache; desc=MISS, edge; dur=1727, origin; dur=188, ak_p; desc="..."
strict-transport-security: max-age=15768000 ; includeSubDomains ; preload
transfer-encoding: chunked
x-correlation-id: 137df151-ee4f-448f-9686-9e8fed4f4c90
x-envoy-upstream-service-time: 158
x-request-id: 548d533c-1d63-4d78-b5a9-48ce7555fb07

[+1.996s] S[+1.998s] SE[+1.998s] ...

OpenAI SDK (remote server) — ~0.38–0.47s delay before first chunk

HTTP 200
alt-svc: h3=":443"; ma=93600
cache-control: max-age=0, no-cache, no-store
connection: keep-alive, Transfer-Encoding
content-type: text/event-stream;charset=utf-8
date: Tue, 27 Jan 2026 23:57:23 GMT
expires: Tue, 27 Jan 2026 23:57:23 GMT
pragma: no-cache
server-timing: cdn-cache; desc=MISS, edge; dur=81, origin; dur=163, ak_p; desc="..."
strict-transport-security: max-age=15768000 ; includeSubDomains ; preload
transfer-encoding: chunked
x-correlation-id: 872d2063-0805-4306-b0e5-679c84e130fd
x-envoy-upstream-service-time: 153
x-request-id: dda88dc6-1ea1-4ad7-8b47-812649bac8be

[+0.376s] S[+0.383s] SE[+0.389s] ...

LlamaStack (hitting localhost) — ~0.66–0.78s TTFT

HTTP 200
cache-control: no-cache
connection: keep-alive
content-type: text/event-stream;charset=utf-8
date: Tue, 27 Jan 2026 23:59:16 GMT
transfer-encoding: chunked
x-correlation-id: 98401572-d04d-4eaf-a6e3-9c6595c533b4
x-request-id: 3b554cfe-2693-4983-baf8-d99181e7391c

[+0.659s] Stream[+0.662s] ing[+0.662s] ...

OpenAI SDK (hitting localhost) — ~0.66–0.78s TTFT

HTTP 200
cache-control: no-cache
connection: keep-alive
content-type: text/event-stream;charset=utf-8
date: Tue, 27 Jan 2026 23:59:28 GMT
transfer-encoding: chunked
x-correlation-id: c2d7b76f-b522-42a3-b5b1-ef3b4359a647
x-request-id: 4bee3e1f-5d97-4b79-9299-01a5d26a6d9b

[+0.394s] S[+0.395s] SE[+0.396s]...

Request

Could you help verify whether the stream iterator in llama-stack-client is introducing initial buffering before yielding the first data: frame?
If so, can we get a fix (or a knob) to emit the first chunk immediately once received?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions