feat: add token completions endpoint #1565

ishandhanani · 2025-06-17T18:53:38Z

Summary by CodeRabbit

New Features
- Introduced a new token-level completion API, allowing direct token-based LLM inference with support for streaming and advanced generation parameters.
- Added new HTTP endpoints for token completions, enabled by default in the service configuration.
- Expanded model management to support token completion engines, including listing, adding, and removing models.
Improvements
- Enhanced protocol and type definitions to support token completion requests and responses.
- Made additional traits and functions public for broader integration and extensibility.

…/deepep

coderabbitai · 2025-06-17T21:32:41Z

Walkthrough

Support for a new Dynamo Token Completion API has been added. This includes a new protocol, engine types, model management, and an HTTP endpoint for token-level completions. The implementation integrates with the existing service and model watcher, exposes new configuration options, and makes several traits and functions public for external use.

Changes

File(s)	Change Summary
`lib/llm/src/discovery/model_manager.rs`	Extended `ModelManager` to manage token completion engines; added related methods.
`lib/llm/src/discovery/watcher.rs`	Added logic to handle token completion models in model lifecycle management.
`lib/llm/src/http/service.rs`	Added module declaration for `token_completions`.
`lib/llm/src/http/service/openai.rs`	Made several functions and a struct public; adjusted event filtering based on environment variable.
`lib/llm/src/http/service/service_v2.rs`	Added `enable_dynamo_endpoints` config; conditionally registers token completions endpoint.
`lib/llm/src/http/service/token_completions.rs`	New module: Implements token completions HTTP endpoint and router.
`lib/llm/src/protocols.rs`	Added public module declaration for `token_completions`.
`lib/llm/src/protocols/openai.rs`	Made traits `OpenAISamplingOptionsProvider` and `OpenAIStopConditionsProvider` public.
`lib/llm/src/protocols/token_completions.rs`	New module: Defines Dynamo token completions protocol, request struct, operator, and conversions.
`lib/llm/src/types.rs`	Added `token_completions` module; exports request/response types and engine aliases.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant HTTP_Service
    participant ModelManager
    participant TokenCompletionEngine

    Client->>HTTP_Service: POST /tokens/completion (DynamoTokenCompletionRequest)
    HTTP_Service->>ModelManager: get_token_completion_engine(model)
    ModelManager-->>HTTP_Service: TokenCompletionStreamingEngine
    HTTP_Service->>TokenCompletionEngine: generate(request)
    TokenCompletionEngine-->>HTTP_Service: Stream<CompletionResponse>
    HTTP_Service-->>Client: JSON CompletionResponse

Poem

In the warren, code hops anew,
Token completions now stream right through!
Engines managed, requests fly—
Dynamo rabbits reaching high.
With endpoints fresh and traits set free,
The meadow grows for all to see.
🐇✨

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 3

🔭 Outside diff range comments (1)

lib/llm/src/discovery/model_manager.rs (1)
68-72: has_model_any forgets token-completion models

has_model_any returns true for chat/completions engines only. Calls that rely on it (e.g. health probes, routing decisions) will incorrectly report that a token-completion model is missing.
-        self.chat_completion_engines.read().unwrap().contains(model)
-            || self.completion_engines.read().unwrap().contains(model)
+        self.chat_completion_engines.read().unwrap().contains(model)
+            || self.completion_engines.read().unwrap().contains(model)
+            || self.token_completion_engines.read().unwrap().contains(model)

🧹 Nitpick comments (8)

lib/llm/src/protocols/openai.rs (1)

135-155: Visibility bump – document in the public changelog

Making these traits pub is fine, but it’s a semver-visible surface change.
Consider adding release-notes explaining the new public API so external users know it’s supported.

lib/llm/src/http/service/service_v2.rs (1)

97-99: Missing doc comment for new config flag

enable_dynamo_endpoints lacks a short doc comment describing what “Dynamo endpoints” are and why one might disable them.

lib/llm/src/discovery/watcher.rs (2)

284-327: Pipeline construction duplicated – extract helper

The token-completion engine setup repeats ~40 lines already present for chat & text completions.
Factoring a helper like build_backend_pipeline<F, R>(...) would:

• remove copy-paste risk
• centralise future fixes (router-mode branches, KV chooser, etc.)

No functional bug, but maintainability will suffer as each pipeline evolves.

250-252: Unnecessary clone?

client.clone() is used for the completions pipeline but the original client is later moved into the token-completion router, making the first clone redundant. Dropping the clone reduces one cheap but avoidable Arc bump.

lib/llm/src/types.rs (1)

20-34: Avoid super::* – keep the public surface explicit

Using use super::*; inside the nested module brings every item from the parent scope into the module, making it easy to shadow names accidentally and hiding the real dependencies. Prefer importing only what the module actually needs (e.g. Annotated, TokenIdType, protocols) for clearer intent and faster compile times.
lib/llm/src/http/service/openai.rs (2)
389-392: Restrict visibility to pub(crate)

check_ready is now pub, but it is still consumed only inside the crate. Exposing it publicly enlarges the API surface and makes breaking-changes harder. Change to pub(crate) (or keep it private) unless there is a concrete cross-crate consumer.

512-515: Cache the env-var lookup

process_event_converter calls std::env::var("DYN_RICH_EVENT_STREAM") for every chunk. This shows up on the hot path when streaming many tokens.

Store the boolean once:
static RICH_STREAM: once_cell::sync::Lazy<bool> =
    once_cell::sync::Lazy::new(|| std::env::var("DYN_RICH_EVENT_STREAM").is_ok());
and replace the runtime check with if !*RICH_STREAM && annotated.event.as_deref() == Some(ANNOTATION_LLM_METRICS) { ... }.
lib/llm/src/protocols/token_completions.rs (1)
140-147: Validate model field as well

While token_ids is checked for emptiness, an empty model string would propagate downstream and surface as a less actionable backend error. Add an early guard:
if request.model.trim().is_empty() {
    return Err(anyhow::anyhow!("model cannot be empty"));
}

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fbf1ffd and a7bb860.

📒 Files selected for processing (10)

lib/llm/src/discovery/model_manager.rs (7 hunks)
lib/llm/src/discovery/watcher.rs (4 hunks)
lib/llm/src/http/service.rs (1 hunks)
lib/llm/src/http/service/openai.rs (4 hunks)
lib/llm/src/http/service/service_v2.rs (2 hunks)
lib/llm/src/http/service/token_completions.rs (1 hunks)
lib/llm/src/protocols.rs (1 hunks)
lib/llm/src/protocols/openai.rs (2 hunks)
lib/llm/src/protocols/token_completions.rs (1 hunks)
lib/llm/src/types.rs (1 hunks)

🧰 Additional context used

🪛 GitHub Actions: Rust pre-merge checks

lib/llm/src/protocols/token_completions.rs

[error] 193-193: Clippy lint 'useless_conversion': useless conversion to the same type anyhow::Error. Consider removing Error::from() and use e directly.

🪛 GitHub Actions: NVIDIA Test Github Validation

lib/llm/src/protocols/token_completions.rs

[error] 193-193: Clippy lint 'useless_conversion': useless conversion to the same type anyhow::Error. Consider removing Error::from() and use e directly. (clippy::useless-conversion)

🔇 Additional comments (4)

lib/llm/src/protocols.rs (1)

29-29: Module exposed correctly – no further action required

The public declaration neatly wires the new protocol into the crate.

lib/llm/src/http/service.rs (1)

22-22: Router module added – looks good

mod token_completions; cleanly registers the new HTTP routes.

lib/llm/src/http/service/service_v2.rs (1)

196-201: Route injection mirrors existing pattern – good

Conditional merge follows the same pattern as other endpoint families; no concerns.

lib/llm/src/discovery/watcher.rs (1)

157-158: Ensure remove_token_completion_model is implemented thread-safe

Confirm that the new ModelManager removal method acquires the same write-lock strategy as other removals to avoid dead-reader races.

lib/llm/src/http/service/token_completions.rs

coderabbitai · 2025-06-17T21:32:44Z

lib/llm/src/protocols/token_completions.rs

+        let mut delta_generator = crate::protocols::openai::completions::DeltaGenerator::new(
+            "model".to_string(),
+            Default::default(),
+        );


🛠️ Refactor suggestion

Preserve the requested model name in the response stream

DeltaGenerator::new("model".to_string(), …) hard-codes the model id, losing the real request.model. Down-stream consumers (metrics, client libraries) will mis-identify the source model.

- let mut delta_generator = crate::protocols::openai::completions::DeltaGenerator::new( - "model".to_string(), - Default::default(), - ); + let model_name = token_request.model.clone(); + let mut delta_generator = crate::protocols::openai::completions::DeltaGenerator::new( + model_name, + Default::default(), + );

Clone before the move so token_request ownership is not lost.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

let mut delta_generator = crate::protocols::openai::completions::DeltaGenerator::new(

"model".to_string(),

Default::default(),

);

let model_name = token_request.model.clone();

let mut delta_generator = crate::protocols::openai::completions::DeltaGenerator::new(

model_name,

Default::default(),

);

🤖 Prompt for AI Agents

In lib/llm/src/protocols/token_completions.rs around lines 200 to 203, the model name is hard-coded as "model" when creating DeltaGenerator, which causes loss of the actual requested model name from token_request. Fix this by cloning the real model name from token_request before passing it to DeltaGenerator::new, preserving the original ownership and ensuring downstream consumers receive the correct model identifier.

lib/llm/src/protocols/token_completions.rs

jthomson04 · 2025-06-17T21:48:16Z

lib/llm/src/discovery/model_manager.rs

@@ -114,6 +122,15 @@ impl ModelManager {
        clients.add(model, engine)
    }

+    pub fn add_token_completion_model(


Not specific to this MR, but all these functions seem brutally repetitive, especially if we plan on adding more.

lib/llm/src/http/service/openai.rs

rmccorm4

Left some comments on possibly re-using completions endpoint

rmccorm4 · 2025-06-18T18:40:10Z

lib/llm/src/http/service/token_completions.rs

High level question - can we just re-use the existing completions endpoint for this functionality?

Rough understanding so far:

CreateCompletionRequest already accepts prompt being an array of Token Ids and so does the rust type we're using

We're already returning CompletionResponses from this new endpoint

So why not re-use CreateCompletionRequest?

Thoughts around using existing completions endpoint:

Can we just skip tokenization if we detect prompt is vec and not string?

Similar idea to how use_raw_prompt is used in the code to skip chat template

We may just be able to detect if Prompt is Prompt::IntegerArray or Prompt::ArrayOfIntegerArray and skip tokenizer.encode() step if so

Then we can just directly set builder.token_ids in preprocessor from request.prompt here

Other relevant spots in the code:

Currently seems like tokens input might get converted to string with prompt_to_string around here

Backup Idea if above won't work for some reason:

If there are issues with above and we actually need/want a separate engine, what about creating both engines, and dynamically selecting "string" engine or "tokens" engine in completions endpoint here?

ex: Rough idea if re-using same engine and just dynamically skipping tokenization / template if prompt is already token ID array:

// preprocessor.rs if is_token_ids_prompt(request.prompt) { token_ids = request.prompt; } else { token_ids = tokenizer.encode(request.prompt) } ... builder.token_ids = token_ids;

Can try this on client-side today, but it doesn't work as expected because it seems like it casts token ids to a string:

root@ced35d0-lcedt:/workspace# curl localhost:8080/v1/completions -H "Content-Type: application/json" -d '{ "model": "Qwen/Qwen3-0.6B", "prompt": [785, 1850, 1616, 311, 8180, 264, 7795, 28041, 374], "stream": false, "max_tokens": 30 }' | jq { "id": "cmpl-ec552f06-914d-4e3f-ae0e-01c7e94708f6", "choices": [ { "text": "<think>\nOkay, let's see. I need to figure out the pattern in the sequence: 785 1850 ", "index": 0, "finish_reason": null } ], "created": 1750270685, "model": "Qwen/Qwen3-0.6B", "object": "text_completion", "usage": { "completion_tokens": 29, "prompt_tokens": 49, "total_tokens": 0 } }

Hopefully this would reduce necessary changes from ~500 lines of code to <50 and "fix" support for accepting prompt of type IntegerArray/ArrayOfIntegerArray as defined in openai completions spec.

Discussed this offline. I think this is the way to go. Will work on a refactor

ishandhanani and others added 30 commits May 19, 2025 06:09

bump

929ab2b

add nats etcd

e897e79

mv

26543c7

chore

1929a9f

Merge branch 'main' of github.com:ai-dynamo/dynamo

c2fac76

Merge branch 'main' into ishan/deepep

a72622e

name

f20af12

yo

7094e3b

Merge branch 'main' into ishan/deepep

9ec3711

bump

502a407

configs

84d5bc4

cgs

8e6d665

bump

ce4f1af

bump

f5478d8

Merge branch 'main' into ishan/deepep

14deb4b

Merge branch 'main' into ishan/deepep

1cb1e65

Merge branch 'ishan/deepep' of github.com:ai-dynamo/dynamo into ishan…

dc3c422

…/deepep

bump

133e9fb

bump

0a36a14

Merge branch 'main' into ishan/deepep

0c6efef

bump

7b5d502

pin

6c87c62

update

a804f2f

Merge branch 'main' into ishan/deepep

9bb867b

bump

3289615

moar

0da80c2

shebang

67c3717

pre

94b12ce

hdr

eeadb18

ok

1b27c14

pull-request-size bot added size/L and removed size/XL labels Jun 17, 2025

bump

a7bb860

ishandhanani marked this pull request as ready for review June 17, 2025 21:30

ishandhanani requested review from ryanolson, grahamking, paulhendricks, biswapanda, tmonty12, GuanLuo, rmccorm4, alec-flowers, kkranen, oandreeva-nv, jthomson04, PeaBrane and a team as code owners June 17, 2025 21:30

coderabbitai bot reviewed Jun 17, 2025

View reviewed changes

ishandhanani added 4 commits June 17, 2025 21:39

comments

8b45c4e

go

ca160a6

go

d6dab11

huh huh emporium

d7e9b0a

jthomson04 reviewed Jun 17, 2025

View reviewed changes

ishandhanani added 3 commits June 17, 2025 22:25

no pubby

100045f

push

3d25d7e

done for all

a812916

ishandhanani mentioned this pull request Jun 18, 2025

[wip]: instructions to run sglang h100 blog post results with dynamo #1583

Draft

rmccorm4 reviewed Jun 18, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add token completions endpoint #1565

feat: add token completions endpoint #1565

Uh oh!

ishandhanani commented Jun 17, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jun 17, 2025

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot Jun 17, 2025

Uh oh!

Uh oh!

jthomson04 Jun 17, 2025

Uh oh!

Uh oh!

rmccorm4 left a comment •

edited

Loading

Uh oh!

rmccorm4 Jun 18, 2025

Uh oh!

rmccorm4 Jun 18, 2025

Uh oh!

rmccorm4 Jun 18, 2025 •

edited

Loading

Uh oh!

ishandhanani Jun 18, 2025

Uh oh!

Uh oh!

feat: add token completions endpoint #1565

Are you sure you want to change the base?

feat: add token completions endpoint #1565

Uh oh!

Conversation

ishandhanani commented Jun 17, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jun 17, 2025

Walkthrough

Changes

Sequence Diagram(s)

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jthomson04 Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rmccorm4 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rmccorm4 Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

rmccorm4 Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

rmccorm4 Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ishandhanani Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ishandhanani commented Jun 17, 2025 •

edited by coderabbitai bot

Loading

rmccorm4 left a comment •

edited

Loading

rmccorm4 Jun 18, 2025 •

edited

Loading