-
Notifications
You must be signed in to change notification settings - Fork 22
Description
When streaming chat completions from our server endpoint, we consistently see an additional delay before the first streamed chunk (TTFT) when using the LlamaStack TypeScript client, while the OpenAI Node SDK starts streaming almost immediately under the same conditions. The behavior is reproducible across the latest release and v0.4.0‑alpha.7 of llama-stack-client. The issue appears specific to the LlamaStack TypeScript SDK’s streaming path.
Environment
- Node.js: v22.4.1
- LlamaStack TypeScript client versions tested:
- llama-stack-client (latest from npm at time of filing)
- llama-stack-client@0.4.0-alpha.7
- OpenAI Node SDK: openai (latest)
- Request: streaming chat completion (stream: true)
const stream = await client.chat.completions.create({
model: model,
stream: true,
messages: [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Explain SSE streaming in one paragraph." },
],
temperature: 0.7,
});
Will attach full script. Change the import statement to switch between openai/llamastack.
Evidence
Reproducible on all servers we tested including a local instance.
LlamaStack (remote serve) — ~2.0s delay before first chunk
HTTP 200
alt-svc: h3=":443"; ma=93600
cache-control: max-age=0, no-cache, no-store
connection: keep-alive, Transfer-Encoding
content-type: text/event-stream;charset=utf-8
date: Tue, 27 Jan 2026 23:56:57 GMT
expires: Tue, 27 Jan 2026 23:56:57 GMT
pragma: no-cache
server-timing: cdn-cache; desc=MISS, edge; dur=1727, origin; dur=188, ak_p; desc="..."
strict-transport-security: max-age=15768000 ; includeSubDomains ; preload
transfer-encoding: chunked
x-correlation-id: 137df151-ee4f-448f-9686-9e8fed4f4c90
x-envoy-upstream-service-time: 158
x-request-id: 548d533c-1d63-4d78-b5a9-48ce7555fb07
[+1.996s] S[+1.998s] SE[+1.998s] ...
OpenAI SDK (remote server) — ~0.38–0.47s delay before first chunk
HTTP 200
alt-svc: h3=":443"; ma=93600
cache-control: max-age=0, no-cache, no-store
connection: keep-alive, Transfer-Encoding
content-type: text/event-stream;charset=utf-8
date: Tue, 27 Jan 2026 23:57:23 GMT
expires: Tue, 27 Jan 2026 23:57:23 GMT
pragma: no-cache
server-timing: cdn-cache; desc=MISS, edge; dur=81, origin; dur=163, ak_p; desc="..."
strict-transport-security: max-age=15768000 ; includeSubDomains ; preload
transfer-encoding: chunked
x-correlation-id: 872d2063-0805-4306-b0e5-679c84e130fd
x-envoy-upstream-service-time: 153
x-request-id: dda88dc6-1ea1-4ad7-8b47-812649bac8be
[+0.376s] S[+0.383s] SE[+0.389s] ...
LlamaStack (hitting localhost) — ~0.66–0.78s TTFT
HTTP 200
cache-control: no-cache
connection: keep-alive
content-type: text/event-stream;charset=utf-8
date: Tue, 27 Jan 2026 23:59:16 GMT
transfer-encoding: chunked
x-correlation-id: 98401572-d04d-4eaf-a6e3-9c6595c533b4
x-request-id: 3b554cfe-2693-4983-baf8-d99181e7391c
[+0.659s] Stream[+0.662s] ing[+0.662s] ...
OpenAI SDK (hitting localhost) — ~0.66–0.78s TTFT
HTTP 200
cache-control: no-cache
connection: keep-alive
content-type: text/event-stream;charset=utf-8
date: Tue, 27 Jan 2026 23:59:28 GMT
transfer-encoding: chunked
x-correlation-id: c2d7b76f-b522-42a3-b5b1-ef3b4359a647
x-request-id: 4bee3e1f-5d97-4b79-9299-01a5d26a6d9b
[+0.394s] S[+0.395s] SE[+0.396s]...
Request
Could you help verify whether the stream iterator in llama-stack-client is introducing initial buffering before yielding the first data: frame?
If so, can we get a fix (or a knob) to emit the first chunk immediately once received?